| |
Last updated on April 17, 2025. This conference program is tentative and subject to change
Technical Program for Wednesday May 21, 2025
|
WeAT2 |
301 |
SLAM 3 |
Regular Session |
Chair: Civera, Javier | Universidad De Zaragoza |
Co-Chair: Kwon, Cheolhyeon | Ulsan National Institute of Science and Technology |
|
08:30-08:35, Paper WeAT2.1 | |
JPG-SLAM: Joint Point-Gaussian Splatting Representation for Dense Dynamic SLAM |
|
Huang, Kunrui | Wuhan University |
Yang, Wennan | Wuhan University |
Zhou, Pengwei | Affiliation (University, Organization, Company)* |
Li, Li | Wuhan University |
Yao, Jian | Wuhan University |
Keywords: SLAM, RGB-D Perception
Abstract: This paper presents a simultaneous localization and mapping (SLAM) system to provide accurate pose estimation and dynamic scene reconstruction. Our approach proposes a Joint Point-Gaussian Splatting representation, which fully integrates the robustness of isotropic feature points in pose estimation and the flexibility of anisotropic 3D Gaussians in scene representation. This system does not need to suppress the anisotropic representation of Gaussian elements, which enables the mapping module to achieve finer scene representation with lower memory consumption. Additionally, in order to enhance the adaptability of the system in dynamic environments, we introduced a dynamic region recognition module and utilized 3D Gaussian Splatting and 4D Gaussian Splatting representations to represent static and dynamic regions respectively. Furthermore, we developed a local map management strategy for Gaussian Splatting mapping, effectively reducing the memory and computational resource usage in the mapping process. Experiments on public datasets demonstrate that our system achieves state-of-the-art tracking and mapping accuracy compared to existing baselines.
|
|
08:35-08:40, Paper WeAT2.2 | |
FMCW-LIO: A Doppler LiDAR-Inertial Odometry |
|
Zhao, Mingle | University of Macau |
Wang, Jiahao | University of Macau |
Gao, Tianxiao | University of Macao |
Xu, Chengzhong | University of Macau |
Kong, Hui | University of Macau |
Keywords: Sensor Fusion, Localization, SLAM
Abstract: Conventional LiDAR-inertial odometry (LIO) or SLAM methods heavily rely on geometric features of environments, as LiDARs primarily provide range measurements instead of motion measurements. From now on, however, the situation changes thanks to the novel Frequency Modulated Continuous Wave (FMCW) LiDARs. FMCW LiDARs not only offer the point range with high resolution but also capture the instant point Doppler velocity through the Doppler effect. In the letter, we propose FMCW-LIO, a novel and robust LIO, leveraging intrinsic Doppler measurements from FMCW LiDARs. To correctly exploit Doppler velocities, a motion compensation method is designed, and a Doppler-aided observation model is applied for on-manifold state estimation. Then, dynamic points can be effectively removed by the Doppler criteria, deriving more consistent geometric observations. FMCW-LIO eventually achieves accurate state estimation and static mapping, even in structure-degenerated environments. Extensive experiments in diverse scenes are performed and FMCW-LIO outperforms other algorithms on both accuracy and robustness.
|
|
08:40-08:45, Paper WeAT2.3 | |
Submodular Optimization for Keyframe Selection & Usage in SLAM |
|
Thorne, David | University of California, Los Angeles |
Chan, Nathan | University of California, Los Angeles |
Ma, Yanlong | University of California, Los Angeles |
Robison, Christopher, Christa | Army Research Laboratory |
Osteen, Philip | U.S. Army Research Laboratory |
Lopez, Brett | University of California, Los Angeles |
Keywords: SLAM, Optimization and Optimal Control, Field Robots
Abstract: Keyframes are LiDAR scans saved for future reference in Simultaneous Localization And Mapping (SLAM), but despite their central importance most algorithms leave choices of which scans to save and how to use them to wasteful heuristics. This work proposes two novel keyframe selection strategies for localization and map summarization, as well as a novel approach to submap generation which selects keyframes that best constrain localization. Our results show that online keyframe identification and submap generation reduce the number of saved keyframes and improve per scan computation time without compromising localization performance. We also present a map summarization feature for quickly capturing environments under strict map size constraints.
|
|
08:45-08:50, Paper WeAT2.4 | |
Equivariant Filter Design for Range-Only SLAM |
|
Ge, Yixiao | Australian National University |
Pearce, Arthur | Australian National University |
van Goor, Pieter | University of Twente |
Mahony, Robert | Australian National University |
Keywords: SLAM, Range Sensing, Mapping
Abstract: Range-only Simultaneous Localisation and Mapping (RO-SLAM) is of interest in the robotics community due to its practical applications; for example, ultra-wideband (UWB) and Bluetooth Low Energy (BLE) localisation in terrestrial and aerial applications and acoustic beacon localisation in marine applications. In this work, we consider a mobile robot equipped with an inertial measurement unit (IMU) and a range sensor that measures distances to a collection of fixed landmarks. We derive an equivariant filter (EqF) for the RO-SLAM problem based on a symmetry Lie group that is compatible with the range measurements. The proposed filter does not require bootstrapping or initialisation of landmark positions, and demonstrates robustness to the no-prior situation. The filter is demonstrated on a real-world dataset, and it is shown to significantly outperform a state-of-the-art EKF alternative in terms of both accuracy and robustness.
|
|
08:50-08:55, Paper WeAT2.5 | |
Toward Globally Optimal State Estimation Using Automatically Tightened Semidefinite Relaxations |
|
Dümbgen, Frederike | ENS, PSL University |
Holmes, Connor | University of Toronto |
Agro, Ben | University of Toronto |
Barfoot, Timothy | University of Toronto |
Keywords: Optimization and Optimal Control, Localization, Robot Safety, Global Optimality
Abstract: In recent years, semidefinite relaxations of common optimization problems in robotics have attracted growing attention due to their ability to provide globally optimal solutions. In many cases, it was shown that specific handcrafted redundant constraints are required to obtain tight relaxations, and thus global optimality. These constraints are formulation-dependent and typically identified through a lengthy manual process. Instead, the present article suggests an automatic method to find a set of sufficient redundant constraints to obtain tightness, if they exist. We first propose an efficient feasibility check to determine if a given set of variables can lead to a tight formulation. Second, we show how to scale the method to problems of bigger size. At no point of the process do we have to find redundant constraints manually. We showcase the effectiveness of the approach, in simulation and on real datasets, for range-based localization and stereo-based pose estimation. We also reproduce semidefinite relaxations presented in recent literature and show that our automatic method always finds a smaller set of constraints sufficient for tightness than previously considered.
|
|
08:55-09:00, Paper WeAT2.6 | |
Viewpoint-Aware Visibility Scoring for Point Cloud Registration in Loop Closure |
|
Yoon, Ilseung | Ulsan National Institute of Science and Technology |
Islam, Tariq | Ulsan National Institute of Science and Technology |
Kim, Kwangrok | Ulsan National Institute of Science and Technology |
Kwon, Cheolhyeon | Ulsan National Institute of Science and Technology |
Keywords: SLAM, Autonomous Vehicle Navigation, Mapping
Abstract: Abstract—Lidar-based Simultaneous Localization and Mapping (SLAM) encounters a substantial challenge in the form of accumulating errors, which can adversely impact its reliability. Loop closing techniques have been extensively employed to counteract this issue. Nonetheless, the loop closing conundrum remains difficult to resolve, as point clouds often exhibit partial overlap due to disparities in scanning pose (viewpoints). This renders the conventional point cloud registration such as Iterative Closest Point (ICP) algorithm problematic. To overcome this challenge, this paper proposes a two-stage viewpoint-aware point cloud registration technique that assigns suitable weights to the correspondence pairs associating two point clouds from different viewpoints. The weights account for the visibility of points from their respective viewpoint as well as from the viewpoint of the counterpart point cloud, making the registration more relying on commonly visible points from the both viewpoints. Experimental results, utilizing the KITTI and Apollo-SouthBay dataset, indicate that the proposed technique delivers more precise and robust performance compared to the baseline techniques.
|
|
WeAT3 |
303 |
Mechanism Design 1 |
Regular Session |
Chair: Whitney, John Peter | Northeastern University |
Co-Chair: Herneth, Christopher | Technical University Munich |
|
08:30-08:35, Paper WeAT3.1 | |
Tension Dependent Twisted String Actuator Modelling and Efficacy Benchmarking in Force and Impedance Control |
|
Herneth, Christopher | Technical University Munich |
Cheng, Yi | Technical University of Munich |
Ganguly, Amartya | Technical University of Munich |
Haddadin, Sami | Technical University of Munich |
Keywords: Actuation and Joint Mechanisms, Force Control, Tendon/Wire Mechanism
Abstract: This study presents a comprehensive experimental analysis of Twisted String Actuators (TSA), focused on enhancing contraction modelling accuracy and establishing a baseline for TSA tension and impedance control efficacy. A novel TSA string radius function is introduced, computing effective radii for multi-strand bundles based on axial actuator tension. The proposed model was validated in physical experiments, resulting in a reduction of maximal errors between measured and simulated actuator contraction trajectories from up to 60% in established models to around 10% in our work. Additionally, the tension-dependent radius modification effectively reduced errors between the estimated and the measured bundle tension by an order of magnitude, marking an essential step towards TSA control independent of bundle tension measurements. TSA tension control was assessed based on four metrics: accuracy, precision, impact stability, and bandwidth, following ISO 9283:1998 standards. The quality of tension control was found to be dependent on bundle tension, twisting angle and strand quantity, whereas impact stability was maintained in all configurations. Joint impedance control with TSA was evaluated for perturbation stability and position control bandwidth, where the latter was enhanced with increasing joint stiffness. The presented analysis informs designers about the capabilities of TSAs in different configurations, and their respective suitability for desired applications.
|
|
08:35-08:40, Paper WeAT3.2 | |
A Novel Twisted-Winching String Actuator for Robotic Applications: Design and Validation |
|
Poon, Ryan | Massachusetts Institute of Technology |
Padia, Vineet | MIT |
Hunter, Ian | MIT |
Keywords: Tendon/Wire Mechanism, Mechanism Design, Actuation and Joint Mechanisms
Abstract: This paper presents a novel actuator system combining a twisted string actuator (TSA) with a winch mechanism. Relative to traditional hydraulic and pneumatic systems in robotics, TSAs are compact and lightweight but face limitations in stroke length and force-transmission ratios. Our integrated TSA-winch system overcomes these constraints by providing variable transmission ratios through dynamic adjustment. It increases actuator stroke by winching instead of overtwisting, and it improves force output by twisting. The design features a rotating turret that houses a winch, which is mounted on a bevel gear assembly driven by a through-hole drive shaft. Mathematical models are developed for the combined displacement and velocity control of this system. Experimental validation demonstrates the actuator's ability to achieve a wide range of transmission ratios and precise movement control. We present performance data on movement precision and generated forces, discussing the results in the context of existing literature. This research contributes to the development of more versatile and efficient actuation systems for advanced robotic applications and improved automation solutions.
|
|
08:40-08:45, Paper WeAT3.3 | |
Design and Evaluation of High-Performance Motion-Decoupled Cable Transmission Modules |
|
Takei, Ryo | Northeastern University |
Frishman, Samuel | Stanford University |
Whitney, John Peter | Northeastern University |
Keywords: Tendon/Wire Mechanism, Actuation and Joint Mechanisms, Medical Robots and Systems
Abstract: Cable transmissions are commonly used in robotics for remote force transmission, offering a lightweight, compact, and efficient solution for transmitting high forces between input and output. However, cables in flexible compression housings (Bowden cables), exhibit high static friction, which increases exponentially with total bend angle. Alternatively, internally routed ball-bearing supported cable capstan transmissions are low friction, but complex and present challenges in routing multiple sets of cables. In this paper, we propose motion-decoupled cable transmission modules that address these challenges, occupying the middle ground, functioning as discrete-joint ball-bearing supported Bowden cables. Our rolling-plus-twist joint design decouples pairs of routed cables from changing significantly in tension, length, or friction during large angle motion of the linked transmission. Using sub-1 mm diameter high-strength synthetic cable, the transmission exhibits a maximum coupling motion of only 0.15 mm over the full range of motion of the cable-transmission mechanism, approximately 10% of pretension in combined hysteresis and friction, a transmission stiffness of 10 N/mm, weighing just 9 g per rolling joint and 5 g per twist joint. Two applications are demonstrated: cable routing alongside a robot arm for, say, gripper remote actuation, and remote needle advancement for an MRI-safe needle biopsy robot.
|
|
08:45-08:50, Paper WeAT3.4 | |
Advanced Xθ Reluctance Electromagnetic Micropositioning System for Precision Motion Control |
|
Pumphrey, Michael Joseph | University of Guelph |
Alatawneh, Natheer | University of Guelph |
Al Janaideh, Mohammad | University of Guelph |
Keywords: Actuation and Joint Mechanisms
Abstract: This study examines a novel setup of a micropositioning trajectory manipulator in Xθ, energized by a reluctance actuator (RA) and two accompanying moving magnet actuators (MMA). The design is characterized by a C-core RA, which features asymmetrical air gaps between the mover and the stator elements when under angular θ rotation. When the stator coil is energized, a magnetic flux induces a force in the mover. Two MMAs can add force and torque dynamics to the system via solenoid and permanent magnet (PM) pairs to offer additional corrective actions. Facilitating control of a translational x and rotational θ two-degree-of-freedom (2DOF) actuation system. Flexure hinges aid in the retraction force of the mover element and provide needed stiffness to the system without frictional effects. This was modeled analytically and optimized to achieve outlined performance objectives. The system was validated experimentally through triangle, and sinusoidal trajectories in open loop control. The most relevant application is scanning mirror systems where specific targeted rotational and translational trajectories can benefit light beam positioning. This system allows both translation and rotation specifications of a selected trajectory to be realized in one actuation unit, opening up more design possibilities for controlling precision positioning systems.
|
|
08:50-08:55, Paper WeAT3.5 | |
Cycloidal Quasi-Direct Drive Actuator Designs with Learning-Based Torque Estimation for Legged Robotics |
|
Zhu, Alvin | University of California Los Angeles |
Tanaka, Yusuke | University of California, Los Angeles |
Rafeedi, Fadi | University of California, Los Angeles |
Hong, Dennis | UCLA |
Keywords: Machine Learning for Robot Control, Actuation and Joint Mechanisms, Legged Robots
Abstract: This paper presents a novel approach through the design and implementation of Cycloidal Quasi-Direct Drive actuators for legged robotics. The cycloidal gear mechanism, with its inherent high torque density and mechanical robustness, offers significant advantages over conventional designs. By integrating cycloidal gears into the Quasi-Direct Drive framework, we aim to enhance the performance of legged robots, particularly in tasks demanding high torque and dynamic loads, while still keeping them lightweight. Additionally, we develop a torque estimation framework for the actuator using an Actuator Network, which effectively reduces the sim-to-real gap introduced by the cycloidal drive’s complex dynamics. This integration is crucial for capturing the complex dynamics of a cycloidal drive, which contributes to improved learning efficiency, agility, and adaptability for reinforcement learning.
|
|
08:55-09:00, Paper WeAT3.6 | |
Compact Modular Robotic Wrist with Variable Stiffness Capability |
|
Sun, Hyunsoo | Korea Institute of Science and Technology |
Park, Sungwoo | Korea University, KIST |
Hwang, Donghyun | Korea Institute of Science and Technology |
Keywords: Mechanism Design, Compliant Joint/Mechanism, Robotic Wrist, Grasping
Abstract: We have developed a two-degree-of-freedom robotic wrist with variable stiffness capability, designed for situations where collisions between the end-effector and the environment are inevitable. To enhance environmental adaptability and prevent physical damage, the wrist can operate in a low-stiffness mode. However, the flexibility of this mode might negatively impact stable and precise manipulation. To address this, we proposed a robotic wrist that switches between a passive low-stiffness mode for environmental adaptation and an active high-stiffness mode for precise manipulation. Initially, we developed a functional prototype that could manually switch between these modes, demonstrating the wrist's passive low-stiffness and active high-stiffness states. This prototype was designed as a lightweight, flat-type modular device, incorporating a sheet-type flexure as the motion guide and embedding all essential components, including actuators, sensors, and a control unit, into the wrist module. Based on the functional prototype, we developed an improved version to enhance durability and functionality. The resulting wrist module incorporates a three-axis F/T sensor and an impedance control system to control the stiffness. It measures 55 mm in height, weighs 200 g, and offers a 232.4-fold active stiffness variation.
|
|
WeAT4 |
304 |
Vision Applications |
Regular Session |
Co-Chair: Wang, Zhenzhou | Huaibei Normal University |
|
08:30-08:35, Paper WeAT4.1 | |
A Natural-Neighbor-Interpolant-Based Pattern Modeling Method for Robust Decoding of the Structured Light Pattern (I) |
|
Wang, Zhenzhou | Huaibei Normal University |
Liu, Shuo | Fujian Normal University |
Keywords: Computer Vision for Automation, Computer Vision for Manufacturing, Recognition
Abstract: Active stereo vision (ASV) computes the parallax and depth information from the coded structured light patterns. Thus, it could overcome the difficulties of measuring objects without textures and colors. However, decoding of the structured light patterns at locations of color crosstalk, specular reflection and occlusion remains challenging. In this paper, we propose a natural-neighbor-interpolant-based pattern modeling method to decode the structured light point pattern robustly. The robustness is achieved in the sense of hundred percent point segmentation completeness. Due to the hundred percent completeness, the points in the corresponding blocks are matched directly according to their indexes. Experimental results verified the effectiveness of the proposed method.
|
|
08:35-08:40, Paper WeAT4.2 | |
Automated Video Object Detection of Motile Cells under Microscopy |
|
Song, Haocong | University of Toronto |
Chen, Wenyuan | University of Toronto |
Shan, Guanqiao | Dalian University of Technology |
Sun, Chen | University of Toronto |
Wan, Bingqing | University of Toronto |
Dai, Changsheng | Dalian University of Technology |
Liu, Hang | University of Toronto |
Wang, Shanshan | Nanjing Drum Tower Hospital, Affiliated Hospital of Medical Scho |
Sun, Yu | University of Toronto |
Keywords: Computer Vision for Automation
Abstract: Video object detection (VOD) of motile cells (e.g., bacteria and sperm) under microscopy is challenging due to motion blur, sporadic out-of-focus, and pose variations. Compared with VOD in generic scenes, the lower contrast and smaller color space of microscopy imaging further introduce feature overlap between the foreground objects and the background objects (e.g., impurity cells and contaminants). Transformer-based methods have achieved great success in the VOD of generic scenes by utilizing object queries to model the inner-frame objects and the inter-frame objects. However, the appearance overlap problem in microscopy video frames significantly compromises the inter-frame query aggregation by introducing background features into the object query. To tackle this challenge, this paper reports a static-dynamic query-based VOD network that treats object queries of the current video frame and reference video frames differently. Specifically, a two-stage framework is implemented that first generates high-quality object queries of reference frames with a static Transformer decoder pre-trained on a still image dataset. The network is then trained on a per-frame annotated dataset using a dynamic Transformer decoder to model the object queries of the current frame. A Reference Query Relation Module is further proposed to enhance the reference queries for more effective aggregation with the current query. Experiments on clinically collected biopsied sperm datasets validated the effectiveness of the proposed method.
|
|
08:40-08:45, Paper WeAT4.3 | |
Vision-Based Movement Primitives for Lunar Hazard Avoidance |
|
Cloud, Joseph | NASA Kennedy Space Center |
Beksi, William J. | The University of Texas at Arlington |
Schuler, Jason | NASA Kennedy Space Center |
Keywords: Space Robotics and Automation, Mining Robotics, Learning from Demonstration
Abstract: To support sustainable infrastructure on the Moon, NASA is developing the In-Situ Resource Utilization (ISRU) Pilot Excavator (IPEx) to extract and transport lunar regolith for processing and construction. During its mission, IPEx will execute various driving patterns, primarily cycling between excavation and unloading sites, with additional maneuvers such as circular traverses around the lander and raster scans for environmental mapping. In this work, dynamic movement primitives (DMPs) are used to represent these patterns. We augment the DMPs with a vision-based real-time obstacle avoidance system to navigate surface hazards, such as rocks, encountered during traversal. Our approach is evaluated in a high-fidelity simulation replicating the challenging environment of the lunar south pole to demonstrate IPEx’s ability to adapt to surface hazards while fulfilling its operational tasks.
|
|
08:45-08:50, Paper WeAT4.4 | |
LAFNET: Lightweight Aerial Fire Detection Model for Onboard Edge Computing |
|
Zhai, Haozhou | Sun Yat-Sen University |
Yan, Weiming | Sun Yat-Sen University |
Wang, Xiaohan | Sun Yat-Sen University |
Zhao, Tuhao | Sun Yat-Sen University |
Hu, Tianjiang | Sun Yat-Sen University |
Keywords: Deep Learning for Visual Perception, Aerial Systems: Perception and Autonomy, Recognition
Abstract: Fire poses significant threats to life and property, necessitating efficient inspection and accurate identification. Although aerial computer vision algorithms hold great promise, the computational limitations of onboard platforms prevent existing algorithms from meeting high standards of accuracy and real-time performance. To address this challenge, we propose an lightweight aerial fire detection model, LAFNET. This model incorporates the EffiDarknetLight backbone, optimized for lightweight design, integrates specially designed LG block components within the LG PAN neck, resulting in a model Params of only 1.3M. Experimental results demonstrate that our method attains a good trade-off between lightweight design and detection accuracy. Compared to the smallest standard YOLO series' model YOLOv5n, LAFNET improves MAP by 2.1%, while reducing Params and FLOPs by 27.8% and 29.3%, the inference speed on Nvidia Orin Nano edge computing side improves 24.8%. These experiments indicate that LAFNET offers a highly efficient solution for aerial fire detection, combining speed and accuracy.
|
|
08:50-08:55, Paper WeAT4.5 | |
UDSV: Unsupervised Deep Stitching for Tractor-Trailer Surround View |
|
Sun, Leyao | Beijing Institute of Technology |
Liang, Hao | Beijing Institute of Technology |
Dong, Zhipeng | Beijing Institute of Technology |
Yang, Yi | Beijing Institute of Technology |
Fu, Mengyin | Beijing Institute of Technology |
Keywords: Omnidirectional Vision, Computer Vision for Transportation, Intelligent Transportation Systems
Abstract: In recent years, with the rapid development of Advanced Driver Assistance Systems (ADAS), the demand for the precise and efficient surround view stitching system has significantly increased. Traditional stitching methods perform well in small single-unit vehicles with stable camera poses. However, the stitching quality sharply degrades when applied to large tractor-trailers due to the continuous pose changes caused by the non-rigid connection between the tractor and trailer. In detail, first, the extended length of tractor-trailers results in low overlap between cameras, making feature extraction and matching challenging. Additionally, the stitched images often appear irregular, detracting from visual quality. Besides, even if static stitching looks natural, it causes jitter in dynamic scenarios due to random feature extraction. In this paper, we propose an unsupervised deep stitching method for tractortrailer surround view system. We introduce a feature extraction module for tractor-trailer scenarios (FMT) to enhance feature extraction in low-overlap situations. Besides, we design a spatiotemporally consistent control point constraint strategy (STCC) to achieve spatial shape preservation and temporal smoothing effects, resulting in visually consistent and stable stitched sequences. Experimental results from both public and real dataset show that our method efficiently completes tractortrailer surround view stitching, producing well-aligned and natural panoramic images compared to previous methods.
|
|
08:55-09:00, Paper WeAT4.6 | |
Think Step by Step: Chain-Of-Gesture Prompting for Error Detection in Robotic Surgical Videos |
|
Shao, Zhimin | Tsinghua University |
Xu, Jialang | University College London |
Stoyanov, Danail | University College London |
Mazomenos, Evangelos | UCL |
Jin, Yueming | National University of Singapore |
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy, Visual Learning
Abstract: Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy. The code is available at https://github.com/jinlab-imvr/Chain-of-Gesture.
|
|
WeAT5 |
305 |
Aerial Manipulation 1 |
Regular Session |
Chair: Loianno, Giuseppe | New York University |
|
08:30-08:35, Paper WeAT5.1 | |
The Palletrone Cart: Human-Robot Interaction-Based Aerial Cargo Transportation |
|
Park, Geonwoo | Seoul National University of Science and Technology |
Park, Hyungeun | Seoul National University of Science and Technology |
Park, Wooyong | Seoul National University of Science and Technology |
Lee, Dongjae | Seoul National University |
Kim, Murim | Korea Institute of Robot and Convergence |
Lee, Seung Jae | Seoul National University of Science and Technology |
Keywords: Aerial Systems: Mechanics and Control, Physical Human-Robot Interaction, Aerial Systems: Applications
Abstract: This paper presents a new cargo transportation solution based on physical human-robot interaction utilizing a novel fully-actuated multirotor platform called Palletrone. The platform is designed with a spacious upper flat surface for easy cargo loading, complemented by a rear-mounted handle reminiscent of a shopping cart. Flight trajectory control is achieved by a human operator gripping the handle and applying three-dimensional forces and torques while maintaining a stable cargo transport with zero roll and pitch attitude throughout the flight. To facilitate physical human-robot interaction, we employ an admittance control technique. Instead of relying on complex force estimation methods, like in most admittance control implementations, we introduce a simple yet effective estimation technique based on a disturbance observer robust control algorithm. We conducted an analysis of the flight stability and performance in response to changes in system mass resulting from arbitrary cargo loading. Ultimately, we demonstrate that individuals can effectively control the system trajectory by applying appropriate interactive forces and torques. Furthermore, we showcase the performance of the system through various experimental scenarios.
|
|
08:35-08:40, Paper WeAT5.2 | |
Design of a Suspended Manipulator with Aerial Elliptic Winding |
|
Niddam, Ethan | University of Strasbourg, ICube |
Dumon, Jonathan | GIPSA-LAB |
Cuvillon, Loic | University of Strasbourg |
Durand, Sylvain | INSA Strasbourg & ICube |
Querry, Stephane | Polyvionics |
Hably, Ahmad | Grenoble-Inp |
Gangloff, Jacques | University of Strasbourg |
Keywords: Aerial Systems: Mechanics and Control, Art and Entertainment Robotics, Tendon/Wire Mechanism
Abstract: Art is one of the oldest forms of human expression, constantly evolving, taking new forms and using new techniques. With their increased accuracy and versatility, robots can be considered as a new class of tools to perform works of art. The STRAD (STReet Art Drone) project aims to perform a 10-meter- high painting on a vertical surface with sub-centimetric precision. To achieve this goal we introduce a new design for an aerial manipulator with elastic suspension capable of moving from one equilibrium position to another using only its thrusters and an elliptic pulley-counterweight system. A feedback linearization control law is implemented to perform fast and accurate winding and unwinding of an elastic cable.
|
|
08:40-08:45, Paper WeAT5.3 | |
Autonomous Heavy Object Pushing Using a Coaxial Tiltrotor (I) |
|
Hwang, Sunwoo | Seoul National University |
Lee, Dongjae | Seoul National University |
Kim, Changhyeon | Seoul National University |
Kim, H. Jin | Seoul National University |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Mobile Manipulation
Abstract: Aerial physical interaction (APhI) with a multirotor-based platform such as pushing a heavy object demands generation of a sufficiently large interaction force while maintaining the stability. Such requirement can cause rotor saturation, because the rotor thrust enlarged for interaction force may leave a reduced margin for attitude stabilization. We first design an H -shaped coaxial tiltrotor that can generate a sufficiently large interaction force than a conventional multirotor. We then propose an overall framework composed of high-level robust controller and low-level control allocation for the coaxial tiltrotor to ensure robustness against uncertain motion of the unknown interacting object and to overcome the saturation issue. To guarantee the robustness at all time, we design a controller based on a nonlinear disturbance observer (DOB). Then, we formulate a problem of computing low-level actuator inputs avoiding rotor saturation as a tractable nonlinear optimization problem, which can be solved real-time. The proposed framework is validated in extensive real-world experiments where the 3.3 kg tiltrotor successfully pushes a cart weighing up to 60 kg. An ablation study with the tiltrotor shows effectiveness of the proposed control allocation law in avoiding rotor saturation. Furthermore, a comparative experiment with a conventional multirotor shows failure in the same setting, which validates the use of the coaxial tiltrotor. An experimental video can be found at htt
|
|
08:45-08:50, Paper WeAT5.4 | |
Aerial Grasping by Multi-Limbed Flying Robot SPIDAR Based on Vectored Thrust Control |
|
Zhao, Moju | The University of Tokyo |
Keywords: Aerial Systems: Applications, Grasping, Motion Control
Abstract: Delivery by aerial robots is an emerging topic in many scenarios, such as logistics, construction industry, and disaster response. Compared to the standard styles that deploy cage or sling, grasping style by gripper can handle objects in various shapes. A multi-limbed structure with distributed vectorable rotors called SPIDAR shows a higher potential to grasp large object in a three-dimensional manner. Therefore, in this paper, we focus on the advanced usage of the vectored thrust forces to achieve aerial grasping by this robot. First, a vectored thrust control to avoid the aerointerference on the underwind segments (e.g., grasped object) during flight is proposed. Then, an optimization-based planning method that utilizes redundant vectored thrust forces for firm grasping is developed. Finally, we demonstrate the feasibility of the proposed flight control and grasp planning by performing challenging grasping and transporting motion with a spherical object of which the diameter is 0.6m. To the best of our knowledge, this work is the first to achieve multi-finger-like grasping to carry a large object in midair.
|
|
08:50-08:55, Paper WeAT5.5 | |
Hook-Based Aerial Payload Grasping from a Moving Platform |
|
Antal, Peter | Institute for Computer Science and Control (SZTAKI) |
Péni, Tamás | SZTAKI Institute for Computer Science and Control |
Toth, Roland | Eindhoven University of Technology (TU/e) |
Keywords: Aerial Systems: Applications, Motion and Path Planning, Planning under Uncertainty
Abstract: This paper investigates payload grasping from a moving platform using a hook-equipped aerial manipulator. First, a computationally efficient trajectory optimization based on complementarity constraints is proposed to determine the optimal grasping time. To enable application in complex, dynamically changing environments, the future motion of the payload is predicted using a physics simulator-based model. The success of payload grasping under model uncertainties and external disturbances is formally verified through a robustness analysis method based on integral quadratic constraints. The proposed algorithms are evaluated in a high-fidelity physical simulator, and in real flight experiments using a custom-designed aerial manipulator platform.
|
|
08:55-09:00, Paper WeAT5.6 | |
Human-Aware Physical Human-Robot Collaborative Transportation and Manipulation with Multiple Aerial Robots |
|
Li, Guanrui | Worcester Polytechnic Institute |
Xinyang, Liu | New York University |
Loianno, Giuseppe | New York University |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Cooperating Robots, Physical Human-Robot Interaction
Abstract: Human-robot interaction will play an essential role in various industries and daily tasks, enabling robots to effectively collaborate with humans and reduce their physical workload. This paper proposes a novel approach for physical human- robot collaborative transportation and manipulation of a cable- suspended payload with multiple aerial robots. The proposed method enables smooth and intuitive interaction between the transported objects and a human worker. In the same time, we consider distance constraints during the operations by exploiting the internal redundancy of the multi-robot transportation system. We validate the approach through extensive simulation and real-world experiments. These include scenarios where the robot team assists the human in transporting and manipulating a load, or where the human helps the robot team navigate the environment. We experimentally demonstrate for the first time, to the best of our knowledge, that our approach enables a quadrotor team to physically collaborate with a human in manipulating a payload in all 6 DoF in collaborative human- robot transportation and manipulation tasks.
|
|
WeAT6 |
307 |
Vision-Based Navigation 1 |
Regular Session |
Chair: Zhang, Fumin | Hong Kong University of Science and Technology |
|
08:30-08:35, Paper WeAT6.1 | |
VLN-KHVR: Knowledge-And-History Aware Visual Representation for Continuous Vision-And-Language Navigation |
|
Kong, Ping | Tianjin University |
Liu, Ruonan | Shanghai Jiao Tong University |
Xie, Zongxia | Tianjin University |
Pang, Zhibo | KTH Royal Institute of Technology |
Keywords: Vision-Based Navigation
Abstract: Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to navigate with low-level actions following natural language instructions in 3D environments. Most existing approaches utilize observation features from the current step to represent the viewpoint. However, these representations often conflate redundant and essential information for navigation, introducing ambiguity into the agent's action prediction. To address the problem of inadequate representation, we propose a Knowledge-and-History Aware Visual Representation for Continuous Vision-and-Language Navigation (VLN-KHVR). The proposed approach constructs enriched visual representations tailored to navigation instructions, enhancing agents’ navigation performance. Specifically, VLN-KHVR extracts image features from the current observation, retrieves relevant knowledge in the knowledge base, and obtains the history of the navigation episode. Subsequently, the knowledge and history features are filtered to eliminate the information irrelevant to navigation instruction. These refined features are integrated with the instruction for further interaction. Finally, the aggregated features are used to guide navigation. Our model outperforms previous methods on the VLN-CE benchmark, demonstrating the effectiveness of the proposed method.
|
|
08:35-08:40, Paper WeAT6.2 | |
LiteVLoc: Map-Lite Visual Localization for Image Goal Navigation |
|
Jiao, Jianhao | University College London |
He, Jinhao | The Hong Kong University of Science and Technology (Guangzhou) |
Liu, Changkun | The Hong Kong University of Science and Technology |
Aegidius, Sebastian | University College London |
Hu, Xiangcheng | Hong Kong University of Science and Technology |
Braud, Tristan | HKUST |
Kanoulas, Dimitrios | University College London |
Keywords: Localization, Vision-Based Navigation, SLAM
Abstract: This paper presents LiteVLoc, a hierarchical vi-sual localization framework that uses a lightweight topo-metric map to represent the environment. The method consists of three sequential modules that estimate camera poses in a coarse-to-fine manner. Unlike dense 3D mapping methods, LiteVLoc reduces storage by avoiding geometric reconstruction. It uses a learning-based feature matcher to establish dense corre-spondences between sparse keyframes and observations, and then refines poses with a geometric solver, enabling robustness to viewpoint changes. The system assumes depth sensors or stereo camera for deployment. A novel dataset for the map-free relocalization task is also introduced. Extensive experiments including localization and navigation in both simulated and real-world scenarios have validate the system’s performance and demonstrated its precision and efficiency for large-scale de-ployment. Code and data will be made publicly available at the webpage: https://rpl-cs-ucl.github.io/LiteVLoc.
|
|
08:40-08:45, Paper WeAT6.3 | |
BEINGS: Bayesian Embodied Image-Goal Navigation with Gaussian Splatting |
|
Meng, Wugang | Hong Kong University of Science and Technology |
Wu, Tianfu | Hong Kong University of Science and Technology |
Yin, Huan | Hong Kong University of Science and Technology |
Zhang, Fumin | Hong Kong University of Science and Technology |
Keywords: Vision-Based Navigation, Search and Rescue Robots, Probabilistic Inference
Abstract: Image-goal navigation enables a robot to reach the location where a target image was captured, using visual cues for guidance. However, current methods either rely heavily on data and computationally expensive learning-based approaches or lack efficiency in complex environments due to insufficient exploration strategies. To address these limitations, we propose Bayesian Embodied Image-goal Navigation Using Gaussian Splatting, a novel method that formulates ImageNav as an optimal control problem within a model predictive control framework. BEINGS leverages 3D Gaussian Splatting as a scene prior to predict future observations, enabling efficient, real-time navigation decisions grounded in the robot’s sensory experiences. By integrating Bayesian updates, our method dynamically refines the robot's strategy without requiring extensive prior experience or data. Our algorithm is validated through extensive simulations and physical experiments, showcasing its potential for embodied robot systems in visually complex scenarios. Project Page: www.mwg.ink/BEINGS-web.
|
|
08:45-08:50, Paper WeAT6.4 | |
FLAF: Focal Line and Feature-Constrained Active View Planning for Visual Teach and Repeat |
|
Fu, Changfei | SUSTech |
Chen, Weinan | Guangdong University of Technology |
Xu, Wenjun | Peng Cheng Laboratory |
Zhang, Hong | SUSTech |
Keywords: View Planning for SLAM, Vision-Based Navigation, SLAM
Abstract: This paper presents FLAF, a focal line and feature-constrained active view planning method for tracking failure avoidance in feature-based visual navigation of mobile robots. FLAF is built on a feature-based visual teach and repeat (VT&R) framework, which supports robotic applications by teaching robots to cruise various paths that fulfill many daily autonomous navigation requirements. However, tracking failures in feature-based Visual Simultaneous Localization and Mapping (VSLAM), particularly in textureless regions common in human-made environments, poses a significant challenge to the real-world deployment of VT&R. To address this problem, the proposed view planner is integrated into a feature-based VSLAM system, creating an active VT&R solution that mitigates tracking failures. Our system features a Pan-Tilt Unit (PTU)-based active mounted on a mobile robot. Using FLAF, the active camera-based VSLAM (AC-SLAM) operates during the teaching phase to construct a complete path map and in the repeating phase to maintain stable localization. FLAF actively directs the camera toward more map points to avoid mapping failures during path learning and toward more feature-identifiable map points while following the learned trajectory. Experimental results in real scenarios show that FLAF significantly outperforms existing methods by accounting for feature identifiability, particularly the view angle of the features. While effectively dealing with low-texture regions in active view planning, considering feature identifiability enables our active VT&R system to perform well in challenging environments.
|
|
08:50-08:55, Paper WeAT6.5 | |
Ground-Level Viewpoint Vision-And-Language Navigation in Continuous Environments |
|
Li, Zerui | Adelaide University |
Zhou, Gengze | University of Adelaide |
Hong, Haodong | The University of Queensland |
Shao, Yanyan | Zhejiang University of Technology |
Lyu, Wenqi | The University of Adelaide |
Qiao, Yanyuan | The University of Adelaide |
Wu, Qi | University of Adelaide |
Keywords: Deep Learning Methods, Vision-Based Navigation
Abstract: Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment is still challenging. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.
|
|
08:55-09:00, Paper WeAT6.6 | |
NavTr: Object-Goal Navigation with Learnable Transformer Queries |
|
Mao, Qiuyu | University of Science and Technology of China |
Jikai, Wang | University of Science and Technology of China, Department of Aut |
Xu, Meng | University of Science and Technology of China |
Chen, Zonghai | University of Sciences and Technology of China |
Keywords: Vision-Based Navigation, Representation Learning, Reinforcement Learning
Abstract: This paper introduces Navigation Transformer (NavTr), a novel framework for object-goal navigation using Transformer queries to enhance the learning and representation of environment states. By integrating semantic information, object positions, and neighborhood information, NavTr creates a unified, comprehensive, and extensible state representation for the object-goal navigating task. In the framework, the Transformer queries implicitly learn inter-object relationships, which facilitates high-level understanding of the environment. Additionally, NavTr implements target-oriented supervisory signals, such as rotation rewards and spatial loss, which improve exploration efficiency in the reinforcement learning framework. NavTr outperforms popular graph-based and Attention-based methods by a large margin in terms of success rate (SR) and success weighted by path length (SPL). Extensive experiments on the AI2-THOR dataset demonstrate the effectiveness of our approach.
|
|
WeAT7 |
309 |
Marine Robotics 3 |
Regular Session |
Chair: Rekleitis, Ioannis | University of Delaware |
Co-Chair: Drupt, Juliette | University of Montpellier |
|
08:30-08:35, Paper WeAT7.1 | |
Shape BoW: Generalized Bag of Words for Appearance-Based Loop Closure Detection in Bathymetric SLAM |
|
Zhang, Qianyi | Korea Advanced Institute of Science and Technology |
Kim, Jinwhan | KAIST |
Keywords: Marine Robotics, Autonomous Vehicle Navigation, SLAM
Abstract: Existing bathymetric simultaneous localization and mapping (SLAM) methods predominantly rely on odometry information for loop closure detection, which has a deteriorating performance when handling unreliable odometry data or conducting large-scale mapping missions. This letter introduces a novel generalized Bag of Words (BoW) named Shape BoW (S-BoW) for appearance-based loop closure detection in bathymetric SLAM. S-BoW is trained from the collection of the terrain gradient features extracted from existing bathymetric datasets and can be used in various bathymetric scenarios. We integrated the loop closure detection method using S-BoW into a feature-based bathymetric SLAM method called TTT SLAM, and we evaluated its performance against three existing bathymetric SLAM methods using two datasets. The results indicate that S-BoW not only serves as a generalized BoW but also enhances the efficiency of the integrated SLAM method, achieving accuracy comparable to the original TTT SLAM while offering a 37% speed improvement in a large-scale sea trial dataset. To the best of our knowledge, S-BoW is the first generalized BoW that can be used to realize effective appearance-based loop closure detection in bathymetric SLAM.
|
|
08:35-08:40, Paper WeAT7.2 | |
ODYSSEE: Oyster Detection Yielded by Sensor Systems on Edge Electronics |
|
Lin, Xiaomin | University of Maryland |
Mange, Vivek Dharmesh | University of Delaware |
Suresh, Arjun | University of Maryland, College Park |
Palnitkar, Aadi | University of Maryland College Park |
Neuberger, Bernhard | TU Wien |
Campbell, Brendan | University of Delaware School of Marine Science and Policy |
Williams, Alan | University of Maryland Center for Environmental Science |
Baxevani, Kleio | University of Delaware |
Mallette, Jeremy | Independent Robotics |
Vera Gonzalez, Alhim Adonai | University of Cincinnati |
Vincze, Markus | Vienna University of Technology |
Rekleitis, Ioannis | University of Delaware |
Tanner, Herbert G. | University of Delaware |
Aloimonos, Yiannis | University of Maryland |
Keywords: Marine Robotics, Recognition, Data Sets for Robot Learning
Abstract: Oysters are an important keystone species in coastal ecosystems that provide several economic, environmental, and cultural benefits. Given the array of utilities derived from oysters, the application of autonomous robotic systems for oyster detection and monitoring grows increasingly relevant. However, current monitoring strategies for assessing oyster assemblages are mostly destructive. While manually identifying and monitoring oysters from video footage is nondestructive, is it tedious and requires expert input. An alternative to human monitoring is deploying trained object detection models on edge devices, such as the Aqua2 robot, to enable real-time monitoring of oysters directly in the field. Yet training these models to maximum efficacy requires an extensive dataset that accurately represents the domain, and it is difficult to obtain such high-quality training data due to the complications inherent to underwater environments. To address these complications, we introduce a novel method leveraging stable diffusion to generate high-quality synthetic data for the marine domain. We exploit diffusion models to create photorealistic oyster imagery, using ControlNet inputs to ensure consistency with the segmentation ground-truth mask, the geometry of the scene, and the target domain of real oyster images. This large dataset is used to train a vision model, specifically based on YOLOv10. The trained model is then deployed and tested on an edge platform, the Aqua2, in an underwater robotics system. We achieve state-of-the-art (0.657 mAP@50) for oyster detection, which can pave the way for autonomous oyster habitat monitoring and increase the efficiency of on-bottom oyster aquaculture
|
|
08:40-08:45, Paper WeAT7.3 | |
IBURD: Image Blending for Underwater Robotic Detection |
|
Hong, Jungseok | MIT |
Singh, Sakshi | University of Minnesota |
Sattar, Junaed | University of Minnesota |
Keywords: Marine Robotics, Data Sets for Robotic Vision, Visual Learning
Abstract: We present an image blending pipeline, IBURD, that creates realistic synthetic images to assist in the training of deep detectors for use on underwater autonomous vehicles (AUVs) for marine debris detection tasks. Specifically, IBURD generates both images of underwater debris and their pixel-level annotations, using source images of debris objects, their annotations, and target background images of marine environments. With Poisson editing and style transfer techniques, IBURD is even able to robustly blend transparent objects into arbitrary backgrounds and automatically adjust the style of blended images using the blurriness metric of target background images. These generated images of marine debris in actual underwater backgrounds address the data scarcity and data variety problems faced by deep-learned vision algorithms in challenging underwater conditions, and can enable the use of AUVs for environmental cleanup missions. Both quantitative and robotic evaluations of IBURD demonstrate the efficacy of the proposed approach for robotic detection of marine debris.
|
|
08:45-08:50, Paper WeAT7.4 | |
3DSSDF: Underwater 3D Sonar Reconstruction Using Signed Distance Functions |
|
Archieri, Simon | Heriot-Watt University |
Drupt, Juliette | University of Montpellier |
Cinar, Ahmet Fatih | Frontier Robotics |
Grimaldi, Michele | University of Girona |
Carlucho, Ignacio | University of Edinburgh |
Scharff Willners, Jonatan | Heriot-Watt University |
Petillot, Yvan R. | Heriot-Watt University |
Keywords: Marine Robotics, Mapping
Abstract: Underwater autonomous robotic operations require online localization and 3D mapping. Because of the absence of absolute positioning underwater, these tasks strongly rely on embedded sensors, including proprioceptive or navigation sensors — which can be fused for an odometry, — and exteroceptive sensors. One of the most popular exteroceptive sensors for underwater is the imaging sonar, which emits a large fan-shaped acoustic signal and estimates the position of the surrounding obstacles from a measure of the reflected signal. This paper addresses underwater online localization and 3D mapping using a forward looking, wide-aperture imaging sonar and vehicle’s intrinsic navigation estimates. We introduce 3DSSDF (3D Sonar Reconstruction Using Signed Distance Functions), a new localization and 3D mapping algorithm based on signed distance functions, which is evaluated in simulation and on real data, in man-made and natural environments. Comparisons to reference trajectories and maps demonstrate that, in our tests, 3DSSDF efficiently corrects navigation drift and that trajectory and map accuracy is always below 1 m and below 1% of the distanced travelled, which can be sufficient for the safe inspection of natural or artificial underwater structures.
|
|
08:50-08:55, Paper WeAT7.5 | |
Cascade IPG Observer for Underwater Robot State Estimation |
|
Joshi, Kaustubh | University of Maryland College Park |
Liu, Tianchen | University of Maryland, College Park |
Chopra, Nikhil | University of Maryland, College Park |
Keywords: Marine Robotics, Localization, Sensor Fusion
Abstract: This paper presents a novel cascade nonlinear observer framework for inertial state estimation. It tackles the problem of intermediate state estimation when external localization is unavailable or in the event of a sensor outage. The proposed observer comprises two nonlinear observers based on a recently developed iteratively preconditioned gradient descent (IPG) algorithm. It takes the inputs via an IMU preintegration model where the first observer is a quaternion-based IPG. The output for the first observer is the input for the second observer, estimating the velocity and, consequently, the position. The proposed observer is validated on a public underwater dataset and a real-world experiment using our robot platform. The estimation is compared with an extended Kalman filter (EKF) and an invariant extended Kalman filter (InEKF). Results demonstrate that our method outperforms these methods regarding better positional accuracy and lower variance.
|
|
08:55-09:00, Paper WeAT7.6 | |
ResiVis: A Holistic Underwater Motion Planning Approach for Robust Active Perception under Uncertainties |
|
Xanthidis, Marios | SINTEF Ocean |
Skaldebø, Martin | SINTEF Ocean |
Haugaløkken, Bent | SINTEF Ocean |
Evjemo, Linn Danielsen | SINTEF Ocean AS |
Alexis, Kostas | NTNU - Norwegian University of Science and Technology |
Kelasidi, Eleni | NTNU |
Keywords: Marine Robotics, Planning under Uncertainty, Collision Avoidance
Abstract: Motion planning for autonomous active perception in cluttered environments remains a challenging problem, requiring real-time solutions that both maximize safety and achieve a desired behavior. In dynamic underwater environments, such as in aquaculture operations, the robots are additionally expected to deal with state and motion uncertainty and errors, dynamic and deformable obstacles, currents, and disturbances. Previous work has introduced real-time frameworks that provided safe navigation in cluttered environments, active perception in static environments, and robust navigation in uncertain dynamic environments. This paper introduces a new real-time approach called ResiVis, which leverages the best aspects of the aforementioned techniques along with a new formulation that further enhances underwater autonomy by enabling active perception of static and dynamic target objects from desired distances. The proposed method utilizes path-optimization for real-time response with constraints guaranteeing continuous collision safety, and computes paths with clearance adaptive to both the conditions of the environments and the performance of the path follower. An improved new constraint encourages observations of dynamic objects with the planner adapting to satisfy desired observation distances and their projected future positions. ResiVis is validated with challenging simulation experiments and with hardware-in-the-loop trials in real industrial-scale aquaculture facilities.
|
|
WeAT8 |
311 |
Planinng and Control for Legged Robots 1 |
Regular Session |
Chair: Gan, Zhenyu | Syracuse University |
Co-Chair: Remy, C. David | University of Stuttgart |
|
08:30-08:35, Paper WeAT8.1 | |
Energy-Optimal Asymmetrical Gait Selection for Quadrupedal Robots |
|
Alqaham, Yasser G. | Syracuse University |
Cheng, Jing | Syracuse University |
Gan, Zhenyu | Syracuse University |
Keywords: Legged Robots, Optimization and Optimal Control, Dynamics
Abstract: Symmetrical gaits, such as trotting, are com- monly employed in quadrupedal robots for their simplicity and stability. However, the potential of asymmetrical gaits, such as bounding and galloping—which are prevalent in their natural counterparts at high speeds or over long distances—is less clear in the design of locomotion controllers for legged machines. In these asymmetrical gaits, the system dynamics are more complex because the front and rear leg pairs exhibit different motions, which are coupled by the rotational motion of the torso. This study systematically examines five distinct asymmetrical quadrupedal gaits on a legged robot, aiming to uncover the fundamental differences in footfall sequences and the consequent energetics across a broad range of speeds. Utilizing a full-body model of a quadrupedal robot (Unitree A1), we developed a hybrid system for each gait, incorporating the desired footfall sequence and rigid impacts. To identify the most energy-optimal gait, we applied optimal control methods, framing it as a trajectory optimization problem with specific constraints and a work-based cost of transport as an objective function. Our results show that, in the context of asymmetrical gaits, when minimizing cost of transport across the entire stride, the front leg pair primarily propels the system forward, while the rear leg pair acts more like an inverted pendulum, contributing significantly less to the energetic output. Addi- tionally, while bounding—characterized by two aerial phases per cycle—is the most energy-optimal gait at higher speeds, the energy expenditure of gaits at speeds below 1 m/s depend heavily on the robot’s specific design.
|
|
08:35-08:40, Paper WeAT8.2 | |
Bipedal Walking with Continuously Compliant Robotic Legs |
|
Bendfeld, Robin | University of Stuttgart |
Remy, C. David | University of Stuttgart |
Keywords: Legged Robots, Compliant Joints and Mechanisms, Motion Control
Abstract: In biomechanics and robotics, elasticity plays a crucial role in enhancing locomotion efficiency and stability. Traditional approaches in legged robots often employ series elastic actuators (SEA) with discrete rigid components, which, while effective, add weight and complexity. This paper presents an innovative alternative by integrating continuously compliant structures into the lower legs of a bipedal robot, fundamentally transforming the SEA concept. Our approach replaces traditional rigid segments with lightweight, deformable materials, reducing overall mass and simplifying the actuation design. This novel design introduces unique challenges in modeling, sensing, and control, due to the infinite dimensionality of continuously compliant elements. We address these challenges through effective approximations and control strategies. The paper details the design and modeling of the compliant leg structure, presents low-level force and kinematics controllers, and introduces a high-level posture controller with a gait scheduler. Experimental results demonstrate successful bipedal walking using this new design.
|
|
08:40-08:45, Paper WeAT8.3 | |
Optimal Torque Distribution Via Dynamic Adaptation for Quadrupedal Locomotion on Slippery Terrains |
|
Argiropoulos, Despina-Ekaterini | (a) Institute of Computer Science Foundation for Research and T |
Maravgakis, Michael | Foundation for Research and Technology - Hellas (FORTH) |
Tian, Changda | FORTH |
Papageorgiou, Dimitrios | Hellenic Mediterranean University |
Trahanias, Panos | Foundation for Research and Technology – Hellas (FORTH) |
Keywords: Legged Robots, Robust/Adaptive Control, Multi-Contact Whole-Body Motion Planning and Control
Abstract: As legged robots continue to evolve, new control methods are being developed to provide fast, robust, accurate and computationally efficient algorithms for traversing challenging environments. This paper presents a real-time adaptive locomotion controller for quadrupeds, designed to maintain stability and controllability on various surfaces, including highly slippery terrains. The proposed approach optimizes control effort distribution based on the probability of slippage by utilizing a surface-independent adaptation layer. By balancing the robot's redundant kinematic system through rank relaxation—similar to loosening constraints in optimization problems—this method demonstrates significant performance improvements. Unlike Reinforcement Learning (RL) approaches, which depend on pre-trained policies and may struggle to adapt velocity tracking control across different terrains, our method rapidly adjusts to changing conditions, as validated by extensive simulation experiments.
|
|
08:45-08:50, Paper WeAT8.4 | |
Adaptive Energy Regularization for Autonomous Gait Transition and Energy-Efficient Quadruped Locomotion |
|
Liang, Boyuan | University of California, Berkeley |
Sun, Lingfeng | University of California, Berkeley |
Zhu, Xinghao | University of California, Berkeley |
Zhang, Bike | University of California, Berkeley |
Xiong, Ziyin | Peking University |
Wang, Yixiao | University of California, Berkeley |
Li, Chenran | University of California, Berkeley |
Sreenath, Koushil | University of California, Berkeley |
Tomizuka, Masayoshi | University of California |
Keywords: Legged Robots, Reinforcement Learning, Natural Machine Motion
Abstract: In reinforcement learning for legged robot locomotion, crafting effective reward strategies is crucial. Predefined gait patterns and complex reward systems are widely used to stabilize policy training. Drawing from the natural locomotion behaviors of humans and animals, which adapt their gaits to minimize energy consumption, we investigate the impact of incorporating an energy-efficient reward term that prioritizes distance-averaged energy consumption into the reinforcement learning framework. Our findings demonstrate that this simple addition enables quadruped robots to autonomously select appropriate gaits—such as four-beat walking at lower speeds and trotting at higher speeds—without the need for explicit gait regularizations. Furthermore, we provide a guideline for tuning the weight of this energy-efficient reward, facilitating its application in real-world scenarios. The effectiveness of our approach is validated through simulations and on a real Unitree Go1 robot. This research highlights the potential of energy-centric reward functions to simplify and enhance the learning of adaptive and efficient locomotion in quadruped robots. Videos and more details are at https://sites.google.com/berkeley.edu/efficient-locomotion.
|
|
08:50-08:55, Paper WeAT8.5 | |
Music-Driven Legged Robots: Synchronized Walking to Rhythmic Beats |
|
Hou, Taixian | FuDan University |
Zhang, Yueqi | Fudan University |
Wei, Xiaoyi | Fudan University |
Dong, Zhiyan | Fudan University |
Yi, Jiafu | Hainan University |
Zhai, Peng | Fudan University |
ZHang, Lihua | Fudan University |
Keywords: Legged Robots, Reinforcement Learning, Biomimetics
Abstract: We address the challenge of effectively controlling the locomotion of legged robots by incorporating precise frequency and phase characteristics, which is often ignored in locomotion policies that do not account for the periodic nature of walking. We propose a hierarchical architecture that integrates a low-level phase tracker, oscillators, and a high-level phase modulator. This controller allows quadruped robots to walk in a natural manner that is synchronized with external musical rhythms. Our method generates diverse gaits across different frequencies and achieves real-time synchronization with music in the physical world. This research establishes a foundational framework for enabling real-time execution of accurate rhythmic motions in legged robots. The video and code are available at https://music-walker.github.io/.
|
|
08:55-09:00, Paper WeAT8.6 | |
Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control |
|
Lu, Chenhao | Tsinghua University |
Cheng, Xuxin | University of California, San Diego |
Li, Jialong | UCSD |
Yang, Shiqi | The Chinese University of Hong Kong, Shenzhen |
Ji, Mazeyu | UCSD |
Yuan, Chengjing | University of California, San Diego |
Yang, Ge | Massachusetts Institute of Technology |
Yi, Sha | UC San Diego |
Wang, Xiaolong | UC San Diego |
Keywords: Humanoid Robot Systems, Sensorimotor Learning, Representation Learning
Abstract: Humanoid robots require both robust lower-body locomotion and precise upper-body manipulation. While recent Reinforcement Learning (RL) approaches provide whole-body loco-manipulation policies, they lack precise manipulation with high DoF arms. In this paper, we propose decoupling upper-body control from locomotion, using inverse kinematics (IK) and motion retargeting for precise manipulation, while RL focuses on robust lower-body locomotion. We introduce PMP (Predictive Motion Priors), trained with Conditional Variational Autoencoder (CVAE) to effectively represent upper-body motions. The locomotion policy is trained and conditioned on this upper-body motion representation, ensuring that the system remains robust with both manipulation and locomotion. We show that CVAE features are crucial for stability and robustness, and significantly outperforms RL-based whole-body control in precise manipulation. With precise upper-body motion and robust lower-body locomotion control, operators can remotely control the humanoid to walk around and explore different environments, while performing diverse manipulation tasks.
|
|
WeAT9 |
312 |
Multi-Robot Planning and Navigation |
Regular Session |
Chair: Nieto-Granda, Carlos | DEVCOM U.S. Army Research Laboratory |
|
08:30-08:35, Paper WeAT9.1 | |
Distributed Safe Navigation of Multi-Agent Systems Using Control Barrier Function-Based Controllers |
|
Mestres, Pol | University of California, San Diego |
Nieto-Granda, Carlos | DEVCOM U.S. Army Research Laboratory |
Cortes, Jorge | University of California, San Diego |
Keywords: Multi-Robot Systems, Collision Avoidance, Optimization and Optimal Control
Abstract: This paper proposes a distributed controller synthesis framework for safe navigation of multi-agent systems. We leverage control barrier functions to formulate collision avoidance with obstacles and teammates as constraints on the control input for a state-dependent network optimization problem that encodes team formation and the navigation task. Our algorithmic solution is valid under general assumptions for nonlinear dynamics and state-dependent network optimization problems with convex constraints and strongly convex objectives. The resulting controller is distributed, satisfies the safety constraints at all times, and asymptotically converges to the solution of the state-dependent network optimization problem. We illustrate its performance in a team of differential-drive robots in a variety of complex environments, both in simulation and in hardware.
|
|
08:35-08:40, Paper WeAT9.2 | |
Hybrid Decision Making for Scalable Multi-Agent Navigation: Integrating Semantic Maps, Discrete Coordination, and Model Predictive Control |
|
de Vos, Koen | Eindhoven University of Technology |
Torta, Elena | Eindhoven University of Technology |
Bruyninckx, Herman | KU Leuven |
López Martínez, César Augusto | Eindhoven University of Technology |
van de Molengraft, Marinus Jacobus Gerardus | University of Technology Eindhoven |
Keywords: Multi-Robot Systems, Cooperating Robots, Constrained Motion Planning
Abstract: This paper presents a framework for multi-agent navigation in structured but dynamic environments, integrating three key components: a shared semantic map encoding metric and semantic environmental knowledge, a claim policy for coordinating access to areas within the environment, and a Model Predictive Controller for generating motion trajectories that respect environmental and coordination constraints. The main advantages of this approach include: (i) enforcing area occupancy constraints derived from specific task requirements; (ii) enhancing computational scalability by eliminating the need for collision avoidance constraints between robotic agents; and (iii) the ability to anticipate and avoid deadlocks between agents. The paper includes both simulations and physical experiments demonstrating the framework’s effectiveness in various representative scenarios
|
|
08:40-08:45, Paper WeAT9.3 | |
Decentralized Nonlinear Model Predictive Control for Safe Collision Avoidance in Quadrotor Teams with Limited Detection Range |
|
Goarin, Manohari | New York University, Tandon School of Engineering |
Li, Guanrui | Worcester Polytechnic Institute |
Saviolo, Alessandro | New York University |
Loianno, Giuseppe | New York University |
Keywords: Aerial Systems: Applications, Distributed Robot Systems, Collision Avoidance
Abstract: Multi-quadrotor systems face significant challenges in decentralized control, particularly with safety and coordination under sensing and communication limitations. State-of-the-art methods leverage Control Barrier Functions (CBFs) to provide safety guarantees but often neglect actuation constraints and limited detection range. To address these gaps, we propose a novel decentralized Nonlinear Model Predictive Control (NMPC) that integrates Exponential CBFs (ECBFs) to enhance safety and optimality in multi-quadrotor systems. We provide both conservative and practical minimum bounds of the range that preserve the safety guarantees of the ECBFs. We validate our approach through extensive simulations with up to 10 quadrotors and 20 obstacles, as well as real-world experiments with 3 quadrotors. Results demonstrate the effectiveness of the proposed framework in realistic settings, highlighting its potential for reliable quadrotor teams operations.
|
|
08:45-08:50, Paper WeAT9.4 | |
SIGMA: Sheaf-Informed Geometric Multi-Agent Pathfinding |
|
Liao, Shuhao | Beihang University |
Xia, Weihang | Zijin Mining |
Cao, Yuhong | National University of Singapore |
Dai, Weiheng | National University of Singapore |
He, Chengyang | National University Singapore |
Wu, Wenjun | Beihang University |
Sartoretti, Guillaume Adrien | National University of Singapore (NUS) |
Keywords: Deep Learning Methods, Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning
Abstract: The Multi-Agent Path Finding (MAPF) problem aims to determine the shortest and collision-free paths for multiple agents in a known, potentially obstacle-ridden environment. It is the core challenge for robotic deployments in large-scale logistics and transportation. Decentralized learning-based approaches have shown great potential for addressing the MAPF problems, offering more reactive and scalable solutions. However, existing learning-based MAPF methods usually rely on agents making decisions based on a limited field of view (FOV), resulting in short-sighted policies and inefficient cooperation in complex scenarios. There, a critical challenge is to achieve consensus on potential movements between agents based on limited observations and communications. To tackle this challenge, we introduce a new framework that applies sheaf theory to decentralized deep reinforcement learning, enabling agents to learn geometric cross-dependencies between each other through local consensus and utilize them for tightly cooperative decision-making. In particular, sheaf theory provides a mathematical proof of conditions for achieving global consensus through local observation. Inspired by this, we incorporate a neural network to approximately model the consensus in latent space based on sheaf theory and train it through self-supervised learning. During the task, in addition to normal features for MAPF as in previous works, each agent distributedly reasons about a learned consensus feature, leading to efficient cooperation on pathfinding and collision avoidance. As a result, our proposed method demonstrates significant improvements over state-of-the-art learning-based MAPF planners, especially in relatively large and complex scenarios, demonstrating its superiority over baselines in various simulations and real-world robot experiments.
|
|
08:50-08:55, Paper WeAT9.5 | |
An Efficient NSGA-II-Based Algorithm for Multi-Robot Coverage Path Planning |
|
Foster, Ashley | University of Plymouth |
Gianni, Mario | University of Liverpool |
Aly, Amir | University of Plymouth |
Samani, Hooman | University of the Arts London |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Distributed Robot Systems
Abstract: This work presents an algorithm based on the Nondominated Sorting Genetic Algorithm II (NSGA-II) to solve multi-objective offline Multi-Robot Coverage Path Planning (MCPP) problems. The proposed algorithm embeds a donation-mutation operator and a multiple-parent crossover that generates solutions which maintain the longest path while minimizing the average path length. The algorithm also uses a library of elitism-selected high-fitness robot paths, and tournament-selected high min-max fitness paths, to construct high multi-objective fitness offspring. We evaluate the performance of our proposed algorithm against the state-of-the-art NSGA-II extended with an improved Heuristic Genetic Algorithm Crossover, and we demonstrate that for different instances of the MCPP problem, the Pareto-fronts of our proposed algorithm are not dominated by any of the points of the fronts generated by the state-of-the-art NSGA-II. A comparison has also been performed in a virtual environment simulating five drones inspecting three wind turbines. Results show that our approach exhibits a higher convergence rate for higher values of the ratio between the number of points to visit and the number of drones.
|
|
08:55-09:00, Paper WeAT9.6 | |
An Iterative Approach for Heterogeneous Multi-Agent Route Planning with Resource Transportation Uncertainty and Temporal Logic Goals |
|
Cardona, Gustavo A. | Lehigh University |
Liang, Kaier | Lehigh University |
Vasile, Cristian Ioan | Lehigh University |
Keywords: Formal Methods in Robotics and Automation, Planning, Scheduling and Coordination, Multi-Robot Systems
Abstract: This paper presents an iterative approach for heterogeneous multi-agent route planning in environments with unknown resource distributions. We focus on a team of robots with diverse capabilities tasked with executing missions specified using Capability Temporal Logic (CaTL), a formal framework built on Signal Temporal Logic to handle spatial, temporal, capability, and resource constraints. The key challenge arises from the uncertainty in the initial distribution and quantity of resources in the environment. To address this, we introduce an iterative algorithm that dynamically balances exploration and task fulfillment. Robots are guided to explore the environment, identifying resource locations and quantities while progressively refining their understanding of the resource landscape. At the same time, they aim to maximally satisfy the mission objectives based on the current information, adapting their strategies as new data is uncovered. This approach provides a robust solution for planning in dynamic, resource-constrained environments, enabling efficient coordination of heterogeneous teams even under conditions of uncertainty. Our method's effectiveness and performance are demonstrated through simulated case studies.
|
|
WeAT10 |
313 |
Multi-Robot Path Planning 1 |
Regular Session |
Chair: Akella, Srinivas | University of North Carolina at Charlotte |
Co-Chair: Delgado, Carmen | I2CAT Foundation |
|
08:30-08:35, Paper WeAT10.1 | |
Connectivity-Preserving Distributed Informative Path Planning for Mobile Robot Networks |
|
Nguyen, Thanh Binh | TAMUCC |
Nghiem, Truong Xuan | University of Central Florida |
Nguyen, Linh | Federation University Australia |
La, Hung | University of Nevada at Reno |
Nguyen, Thang | Texas A&M University-Corpus Christi |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Integrated Planning and Learning, Distributed Robot Systems
Abstract: This letter addresses the distributed informative path planning (IPP) problem for a mobile robot network to optimally explore a spatial field. Each robot is able to gather noisy environmental measurements while navigating the environment and build its own model of a spatial phenomenon using the Gaussian process and local data. The IPP optimization problem is formulated in an informative way through a multi-step prediction scheme constrained by connectivity preservation and collision avoidance. The shared hyperparameters of the local Gaussian process models are also arranged to be optimally computed in the path planning optimization problem. By the use of the proximal alternating direction method of multiplier, the optimization problem can be effectively solved in a distributed manner. It theoretically proves that the connectivity in the network is maintained over time whilst the solution of the optimization problem converges to a stationary point. The effectiveness of the proposed approach is verified in synthetic experiments by utilizing a real-world dataset.
|
|
08:35-08:40, Paper WeAT10.2 | |
A Hierarchical Framework for Solving the Constrained Multiple Depot Traveling Salesman Problem |
|
Yang, Ruixiao | Massachusetts Institute of Technology |
Fan, Chuchu | Massachusetts Institute of Technology |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Planning, Scheduling and Coordination, Task Planning
Abstract: The Multiple Depot Traveling Salesman Problem (MDTSP) is a variant of the NP-hard Traveling Salesman Problem (TSP) with more than one salesman to jointly visit all destinations, commonly found in task planning in multi-agent robotic systems. Traditional MDTSP overlooks practical constraints like limited battery level and inter-agent conflicts, often leading to infeasible or unsafe solutions in reality. In this work, we incorporate energy and resource consumption constraints to form the Constrained MDTSP (CMDTSP). We design a novel hierarchical framework to obtain high-quality solutions with low computational complexity. The framework decomposes a given CMDTSP instance into manageable sub-problems, each handled individually via a TSP solver and heuristic search to generate tours. The tours are then aggregated and processed through a Mixed-Integer Linear Program (MILP), which contains significantly fewer variables and constraints than the MILP for the exact CMDTSP, to form a feasible solution efficiently. We demonstrate the performance of our framework on both real-world and synthetic datasets. It reaches a mean 12.48% optimality gap and 41.7x speedup over the exact method on common instances and a 5.22%sim14.84% solution quality increase with more than 79.8x speedup over the best baseline on large instances where the exact method times out.
|
|
08:40-08:45, Paper WeAT10.3 | |
Fully Differentiable Adaptive Informative Path Planning |
|
Jakkala, Kalvik | University of North Carolina at Charlotte |
Akella, Srinivas | University of North Carolina at Charlotte |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Environment Monitoring and Management, Integrated Planning and Learning
Abstract: Autonomous robots can survey and monitor large environments. However, these robots often have limited computational and power resources, making it crucial to develop an efficient and adaptive informative path planning (IPP) algorithm. Such an algorithm must quickly adapt to environmental data to maximize the information collected while accommodating path constraints, such as distance budgets and boundary limitations. Current approaches to this problem often rely on maximizing mutual information using methods such as greedy algorithms, Bayesian optimization, and genetic algorithms. These methods can be slow and do not scale well to large or 3D environments. We present an adaptive IPP approach that is fully differentiable, significantly faster than previous methods, and scalable to 3D spaces. Our approach also supports continuous sensing robots, which collect data continuously along the entire path, by leveraging streaming sparse Gaussian processes. Benchmark results on two real-world datasets demonstrate that our approach yields solutions that are on par with or better than baseline methods while being up to two orders of magnitude faster. Additionally, we showcase our adaptive IPP approach in a 3D space using a system-on-chip embedded computer with minimal computational resources. Our code is available in the SGP-Tools Python library with a companion ROS 2 package for deployment on ArduPilot-based robots.
|
|
08:45-08:50, Paper WeAT10.4 | |
Online Informative Motion Planning for Active Information Gathering of a Non-Stationary Gaussian Process |
|
Mao, Kexiang | Shanghai Jiao Tong University |
He, Jianping | Shanghai Jiao Tong University |
Duan, Xiaoming | Shanghai Jiao Tong University |
Keywords: Environment Monitoring and Management, Motion and Path Planning, Reactive and Sensor-Based Planning
Abstract: Information gathering focuses on designing strategies for a robot to collect data about a physical process, aiming for accurate field reconstruction. While many recent methods have been proposed to address this problem, they often assume the model of the physical process is a priori known and stationary—assumptions that rarely hold in practice. This paper presents a novel informative motion planning approach for online information gathering of a non-stationary Gaussian process. Our approach comprises two key components: an informative path planner that explores the physical field and an adaptive velocity planner that adjusts the robot's velocity profile exploiting the field's spatial variability. Additionally, we propose a path smoothing and tracking strategy to ensure continuous robot motion. Extensive simulations on a bathymetric mapping task demonstrate the effectiveness of our approach, showing superior performance in reconstructing non-stationary physical fields compared to several baseline methods.
|
|
08:50-08:55, Paper WeAT10.5 | |
REACT: Multi Robot Energy-Aware Orchestrator for Indoor Search and Rescue Critical Tasks |
|
Maresca, Fabio | NEC Laboratories Europe GmbH |
Romero, Arnau | I2CAT Foundation |
Delgado, Carmen | I2CAT Foundation |
Sciancalepore, Vincenzo | NEC Laboratories Europe GmbH |
Paradells, Josep | Universitat Politecnica De Catalunya |
Costa-Perez, Xavier | NEC Laboratories Europe |
Keywords: Search and Rescue Robots, Path Planning for Multiple Mobile Robots or Agents, Robotics in Under-Resourced Settings
Abstract: Smart factories enhance production efficiency and sustainability, but emergencies like human errors, machinery failures and natural disasters pose significant risks. In critical situations, such as fires or earthquakes, collaborative robots can assist first-responders by entering damaged buildings and locating missing persons, mitigating potential losses. Unlike previous solutions that overlook the critical aspect of energy management, in this paper we propose REACT, a smart energy-aware orchestrator that optimizes the exploration phase, ensuring prolonged operational time and effective area coverage. Our solution leverages a fleet of collaborative robots equipped with advanced sensors and communication capabilities to explore and navigate unknown indoor environments, such as smart factories affected by fires or earthquakes, with high density of obstacles. By leveraging real-time data exchange and cooperative algorithms, the robots dynamically adjust their paths, minimize redundant movements and reduce energy consumption. Extensive simulations confirm that our approach significantly improves the efficiency and reliability of search and rescue missions in complex indoor environments, improving the exploration rate by 10% over existing methods and reaching a map coverage of 97% under time critical operations, up to nearly 100% under relaxed time constraint.
|
|
08:55-09:00, Paper WeAT10.6 | |
Multi-Agent Ergodic Exploration under Smoke-Based Time-Varying Visibility Constraints |
|
Wittemyer, Elena | Yale University |
Rao, Ananya | Carnegie Mellon University |
Abraham, Ian | Yale University |
Choset, Howie | Carnegie Mellon University |
Keywords: Aerial Systems: Perception and Autonomy, Vision-Based Navigation, Path Planning for Multiple Mobile Robots or Agents
Abstract: In this work, we consider the problem of multi-agent informative path planning (IPP) for robots whose sensor visibility evolves over time as a consequence of a time-varying natural phenomenon. We leverage ergodic trajectory optimization (ETO), which generates paths such that the amount of time an agent spends in an area is proportional to the expected information in that area. We focus specifically on the problem of multi-agent drone search of a wildfire, where we use the time-varying environmental process of smoke diffusion to construct a sensor visibility model. This sensor visibility model is used to repeatedly calculate an expected information distribution (EID) to be used in the ETO algorithm. Our experiments show that our exploration method achieves improved information gathering over both baseline search methods and naive ergodic search formulations.
|
|
WeAT11 |
314 |
Safe Control 1 |
Regular Session |
Chair: Francis, Jonathan | Bosch Center for Artificial Intelligence |
|
08:30-08:35, Paper WeAT11.1 | |
DiffTune-MPC: Closed-Loop Learning for Model Predictive Control |
|
Tao, Ran | University of Illinois Urbana-Champaign |
Cheng, Sheng | University of Illinois Urbana-Champaign |
Wang, Xiaofeng | University of South Carolina |
Wang, Shenlong | University of Illinois at Urbana-Champaign |
Hovakimyan, Naira | University of Illinois at Urbana-Champaign |
Keywords: Optimization and Optimal Control, Machine Learning for Robot Control, Model Learning for Control
Abstract: Model predictive control (MPC) has been applied to many platforms in robotics and autonomous systems for its capability to predict a system's future behavior while incorporating constraints that a system may have. To enhance the performance of a system with an MPC controller, one can manually tune the MPC's cost function. However, it can be challenging due to the possibly high dimension of the parameter space as well as the potential difference between the open-loop cost function in MPC and the overall closed-loop performance metric function. This paper presents DiffTune-MPC, a novel learning method, to learn the cost function of an MPC in a closed-loop manner. The proposed framework is compatible with the scenario where the time interval for performance evaluation and MPC's planning horizon have different lengths. We show the auxiliary problem whose solution admits the analytical gradients of MPC and discuss its variations in different MPC settings, including nonlinear MPCs that are solved using sequential quadratic programming. Simulation results demonstrate the learning capability of DiffTune-MPC and the generalization capability of the learned MPC parameters.
|
|
08:35-08:40, Paper WeAT11.2 | |
Combined Modal Robust Cascade Control for Wheeled Self-Reconfigurable Robots under Drive Failure and Safety Threat |
|
Jiang, Tao | Chongqing University |
Wang, Jianxiang | Chongqing University |
Zheng, Zhi | Chongqing University |
Mo, Rongqin | Chongqing University |
Sun, Yizhuo | Harbin Institute of Technology |
Keywords: Robot Safety, Motion Control, Robust/Adaptive Control
Abstract: 轮式自重构机器人 (WSRRs) 是一种新型的多机器人系统,具有灵活的配置和任务适应性,在非结构化任务环境中具有广泛的应用前景。该文基于非完整约束和拉格朗日方法,建立了具有任意重配置尺度的 WSRR 的组合模态运动学和动力学。在运动学层面,基于非完整约束,设计了基于安全地理围栏的平滑避障策略来确保安全。在动态层面,引入自适应容错机制,保证合理的扭矩分配,避免跟踪性能下降。同时,该文阐述了一种改进的扩展状态观测器(IESO),通过该算法可以抑制测量噪声的高频振荡和初始观测器误差的峰值现象,实现了未知集总扰动下鲁棒的速度跟踪控制
|
|
08:40-08:45, Paper WeAT11.3 | |
CaDRE: Controllable and Diverse Generation of Safety-Critical Driving Scenarios Using Real-World Trajectories |
|
Huang, Peide | Apple Inc |
Ding, Wenhao | Carnegie Mellon University |
Stoler, Benjamin | Carnegie Mellon University |
Francis, Jonathan | Bosch Center for Artificial Intelligence |
Chen, Bingqing | Bosch Center for AI |
Zhao, Ding | Carnegie Mellon University |
Keywords: Robot Safety, Intelligent Transportation Systems, Autonomous Vehicle Navigation
Abstract: Simulation is an indispensable tool in the development and testing of autonomous vehicles (AVs), offering an efficient and safe alternative to road testing. An outstanding challenge with simulation-based testing is the generation of safety-critical scenarios, which are essential to ensure that AVs can handle rare but potentially fatal situations. This paper addresses this challenge by introducing a novel framework, CaDRE, to generate realistic, diverse, and controllable safety-critical scenarios. Our approach optimizes for both the quality and diversity of scenarios by employing a unique formulation and algorithm that integrates real-world scenarios, domain knowledge, and black-box optimization. We validate the effectiveness of our framework through extensive testing in three representative types of traffic scenarios. The results demonstrate superior performance in generating diverse and high-quality scenarios with greater sample efficiency than existing reinforcement learning (RL) and sampling-based methods.
|
|
08:45-08:50, Paper WeAT11.4 | |
Certificated Actor-Critic: Hierarchical Reinforcement Learning with Control Barrier Functions for Safe Navigation |
|
Xie, Junjun | Harbin Institute of Technology, Shenzhen, China |
Zhao, Shuhao | School of Mechanical Engineering and Automation Harbin Institute |
Hu, Liang | Harbin Institute of Technology, Shenzhen |
Gao, Huijun | Harbin Institute of Technology |
Keywords: Robot Safety, Reinforcement Learning, Machine Learning for Robot Control
Abstract: Control Barrier Functions (CBFs) have emerged as a prominent approach to designing safe navigation systems of robots. Despite their popularity, current CBF-based methods exhibit some limitations: optimization-based safe control techniques tend to be either myopic or computationally intensive, and they rely on simplified system models; conversely, the learning-based methods suffer from the lack of quantitative indication in terms of navigation performance and safety. In this paper, we present a new model-free reinforcement learning algorithm called Certificated Actor-Critic (CAC), which introduces a hierarchical reinforcement learning framework and well-defined reward functions derived from CBFs. We carry out theoretical analysis and proof of our algorithm, and propose several improvements in algorithm implementation. Our analysis is validated by two simulation experiments, showing the effectiveness of our proposed CAC algorithm.
|
|
08:50-08:55, Paper WeAT11.5 | |
Exact Imposition of Safety Boundary Conditions in Neural Reachable Tubes |
|
Singh, Aditya | Indian Institute of Technology, Patna |
Feng, Zeyuan | Stanford University |
Bansal, Somil | Stanford University |
Keywords: Robot Safety, Machine Learning for Robot Control
Abstract: Hamilton-Jacobi (HJ) reachability analysis is a widely adopted verification tool to provide safety and performance guarantees for autonomous systems. However, it involves solving a partial differential equation (PDE) to compute a safety value function, whose computational and memory complexity scales exponentially with the state dimension, making its direct application to large-scale systems intractable. To overcome these challenges, DeepReach,a recently proposed learning-based approach, approximates high-dimensional reachable tubes using neural networks (NNs). While shown to be effective, the accuracy of the learned solution decreases with system complexity. One of the reasons for this degradation is a soft imposition of safety constraints during the learning process, which corresponds to the boundary conditions of the PDE, resulting in inaccurate value functions. In this work, we propose ExactBC, a variant of DeepReach that imposes safety constraints exactly during the learning process by restructuring the overall value function as a weighted sum of the boundary condition and the NN output. Moreover, the proposed variant no longer needs a boundary loss term during the training process, thus eliminating the need to balance different loss terms. We demonstrate the efficacy of the proposed approach in significantly improving the accuracy of the learned value function for four challenging reachability tasks: a rimless wheel system with state resets, collision avoidance in a cluttered environment, autonomous rocket landing, and multi-aircraft collision avoidance.
|
|
08:55-09:00, Paper WeAT11.6 | |
RelAIBotiX: Reliability Assessment for AI-Controlled Robotic Systems |
|
Grimmeisen, Philipp | University of Stuttgart |
Golwalkar, Rucha | University of Lübeck |
Sautter, Friedrich | IAS, Uni Stuttgart |
Morozov, Andrey | University of Stuttgart |
Keywords: Robot Safety, AI-Based Methods, Probability and Statistical Methods
Abstract: AI-controlled robotic systems can introduce significant risks to both humans and the environment. Traditional reliability assessment methods fall short in addressing the complexities of these systems, particularly when dealing with black-box or dynamically changing control policies. The traditional approaches are applied manually and do not consider frequent software updates. In this paper, we present RelAIBotiX, a new methodology that enables dynamic and continuous reliability assessment, specifically tailored for robotic systems controlled by AI-Algorithms. RelAIBotiX is a dynamic reliability assessment framework that combines four methods: (i) Skill Detection that automatically identifies executed skills using deep learning techniques, (ii) Behavioral Analysis that creates an operational profile of the robotic system containing information about the skill execution sequence, active components for each skill, and their utilization intensity that influence their failure rate, (iii) Reliability Model Generation that automatically transforms the operational profile and reliability data of robotic hardware components into quantitative hybrid reliability models, and (iv) Reliability Model Solver for the numerical evaluation of the generated reliability models. Our evaluation included computing the reliability of the system, the probability of failure of individual skills, and component sensitivity analysis. We validated the applicability of the proposed framework in five simulative and real-world setups.
|
|
WeAT12 |
315 |
Human-Robot Interaction 3 |
Regular Session |
Chair: Fitter, Naomi T. | Oregon State University |
Co-Chair: Yuan, Wenzhen | University of Illinois |
|
08:30-08:35, Paper WeAT12.1 | |
Adaptive Emotional Expression in Social Robots: A Multimodal Approach to Dynamic Emotion Modeling |
|
Park, Haeun | Ulsan National Institute of Science and Technology |
Lee, Jiyeon | Ulsan National Institute of Science and Technology |
Lee, Hui Sung | UNIST (Ulsan National Institute of Science and Technology) |
Keywords: Emotional Robotics, Gesture, Posture and Facial Expressions, Robot Companions
Abstract: Social robots have been extensively studied in recent decades, with many researchers exploring the use of modalities such as facial expressions to achieve more natural emotions in robots. Various methods have been attempted to generate and express robot emotions, including computational models that define an affect space and show dynamic emotion changes. However, the implementation of multimodal expression in previous models is ambiguous, and the generation of emotions in response to stimuli relies on heuristic methods. In this paper, we present a framework that enables robots to naturally express their emotions in a multimodal way, where the emotion can change over time based on the given stimulus values. By representing the robot’s emotion as a position in an affect space of a computational emotion model, we consider the given stimuli values as driving forces that can shift the emotion position dynamically. In order to examine the feasibility of our proposed method, a mobile robot prototype was implemented that can recognize touch and express different emotions with facial expressions and movements. The experiment demonstrated that the emotion elicited by a given stimulus is contingent upon the robot’s previous state, thereby imparting the impression that the robot possesses a distinctive emotion model. Furthermore, the Godspeed survey results indicated that our model was rated significantly higher than the baseline, which did not include a computational emotion model, in terms of anthropomorphism, animacy, and perceived intelligence. Notably, the unpredictabil ity of emotion switching contributed to a perception of greater lifelikeness, which in turn enhanced the overall interaction experience.
|
|
08:35-08:40, Paper WeAT12.2 | |
CAS: Fusing DNN Optimization & Adaptive Sensing for Energy-Efficient Multi-Modal Inference |
|
Weerakoon Mudiyanselage, Dulanga Kaveesha Weerakoon | Singapore-MIT Alliance for Research & Technology |
Subbaraju, Vigneshwaran | Agency for Science Technology and Research (A*STAR) |
Lim, Joo Hwee | I2R A*STAR |
Misra, Archan | Singapore Management University |
Keywords: Human-Robot Collaboration, Multi-Modal Perception for HRI, Embedded Systems for Robotic and Automation
Abstract: Intelligent virtual agents are used to accomplish complex multi-modal tasks such as human instruction comprehension in mixed-reality environments by increasingly adopting richer, energy-intensive sensors and processing pipelines. In such applications, the context for activating sensors and processing blocks required to accomplish a given task instance is usually manifested via multiple sensing modes. Based on this observation, we introduce a novel Commit-and-Switch (CAS) paradigm that simultaneously seeks to reduce both sensing and processing energy. In CAS, we first commit to a low-energy computational pipeline with a subset of available sensors. Then, the task context estimated by this pipeline is used to optionally switch to another energy-intensive DNN pipeline and activate additional sensors. We demonstrate how CAS’s paradigm of interweaving DNN computation and sensor triggering can be instantiated principally by constructing multi-head DNN models and jointly optimizing the accuracy and sensing costs associated with different heads. We exemplify CAS via the development of the RealGIN-MH model for multi-modal target acquisition tasks, a core enabler of immersive human-agent interaction. RealGIN-MH achieves 12.9x reduction in energy overheads, while outperforming baseline dynamic model optimization approaches.
|
|
08:40-08:45, Paper WeAT12.3 | |
"Oh! It's Fun Chatting with You!" a Humor-Aware Social Robot Chat Framework |
|
Zhang, Heng | ENSTA Paris, Institut Polytechnique De Paris |
Saood, Adnan | ENSTA Paris - Institute Polytechnique De Paris |
García Cárdenas, Juan José | ENSTA - Institute Polytechinique De Paris |
Hei, Xiaoxuan | ENSTA Paris, Institut Polytechnique De Paris |
Tapus, Adriana | ENSTA Paris, Institut Polytechnique De Paris |
Keywords: Social HRI, Physical Human-Robot Interaction
Abstract: Humor is a key element in human interactions, essential for building connections and rapport. To enhance human-robot communication, we developed a humor-aware chat framework that enables robots to deliver contextually appropriate humor. This framework takes into account the interaction environment, and user’s profile as well as emotional state. Two GPT models are used to generate responses. The initial one, named sensor-GPT, processes contextual data from the sensor along with the user’s response and conversation history to create prompts for the second one, chat-GPT. These prompts can guide the model on how to integrate appropriate humor elements into the conversation, ensuring that the dialogue is both contextually relevant and humorous. Our experiment compared the effectiveness of humor expression between our framework and the GPT-4o model. The results demonstrate that robots using our framework significantly outperform those using GPT-4o in humor expression, extending conversations, and improving overall interaction quality.
|
|
08:45-08:50, Paper WeAT12.4 | |
Social Gesture Recognition in SpHRI: Leveraging Fabric-Based Tactile Sensing on Humanoid Robots |
|
Crowder, Dakarai | University of Illinois Urbana Champaign |
Vandyck, Kojo Egyir | University of Illinois Urbana-Champaign |
Sun, Xiping | University of Illinois Urbana-Champaign Champaign, IL ‧ Pu |
McCann, James | Carnegie Mellon University |
Yuan, Wenzhen | University of Illinois |
Keywords: Physical Human-Robot Interaction, Touch in HRI
Abstract: Humans are able to convey different messages using only touch. Equipping robots with the ability to understand social touch adds another modality in which humans and robots can communicate. In this paper, we present a social gesture recognition system using a fabric-based, large-scale tactile sensor integrated onto the arms of a humanoid robot. We built a social gesture dataset using multiple participants and extracted temporal features for classification. By collecting real-world data on a humanoid robot, our system provides valuable insights into human-robot social touch, further advancing the development of spHRI systems for more natural and effective communication.
|
|
08:50-08:55, Paper WeAT12.5 | |
Seeing Eye to Eye: Design and Evaluation of a Custom Expressive Eye Display Module for the Stretch Mobile Manipulator |
|
Morales Mayoral, Rafael | Oregon State University |
Buchmeier, Sean | Oregon State University |
Mockel, Stayce | Oregon State University |
Chavez, Courtney J. | Oregon State University |
Fitter, Naomi T. | Oregon State University |
Keywords: Gesture, Posture and Facial Expressions, Intention Recognition, Human-Robot Collaboration
Abstract: Mobile manipulators - robots with a moving base and an arm for grasping objects - are becoming more common in human-populated environments, such as hospitals, warehouses, and even homes. Yet most mobile manipulators lack clear ways to communicate intent to human interlocutors in a continuous, socially acceptable, and easy-to-interpret way. One possible solution for improving mobile manipulator communication is the addition of expressive eyes. This paper presents the design and evaluation of a custom expressive LED eye module for mobile manipulators, which can display both gaze and emotional expressions. Our evaluation study (N = 32) involved a mock teamwork task alongside a Hello Robot Stretch RE2 mobile manipulator with the custom LED eye module. The results showed that both gaze and emotional expressions supported better participant performance in the task and more feelings of social closeness. Emotional eye expressions also yielded higher ratings of robot social warmth and competence. This work can inform mobile manipulator design for smoother integration into human-populated spaces.
|
|
08:55-09:00, Paper WeAT12.6 | |
UGotMe: An Embodied System for Affective Human-Robot Interaction |
|
Li, Peizhen | Macquarie University |
Cao, Longbing | Macquarie University |
Wu, Xiao-Ming | Sun Yat-Sen University |
Yu, Xiaohan | Macquarie University |
Runze, Yang | Macquarie University |
Keywords: Social HRI, Gesture, Posture and Facial Expressions, Emotional Robotics
Abstract: Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1)distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://lipzh5.github.io/HumanoidVLE/
|
|
WeAT13 |
316 |
Soft Robotic Grasping 1 |
Regular Session |
Chair: Ichnowski, Jeffrey | Carnegie Mellon University |
Co-Chair: Stewart-Height, Abriana | Massachusetts Institute of Technology |
|
08:30-08:35, Paper WeAT13.1 | |
SCU-Hand: Soft Conical Universal Robotic Hand for Scooping Granular Media from Containers of Various Sizes |
|
Takahashi, Tomoya | OMRON SINIC X Corporation |
Beltran-Hernandez, Cristian Camilo | OMRON SINIC X Corporation |
Kuroda, Yuki | OMRON SINIC X Corporation |
Tanaka, Kazutoshi | OMRON SINIC X Corporation |
Hamaya, Masashi | OMRON SINIC X Corporation |
Ushiku, Yoshitaka | OMRON SINIC X Corpolation |
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Robotics and Automation in Life Sciences
Abstract: Automating small-scale experiments in materials science presents challenges due to the heterogeneous nature of experimental setups. This study introduces the SCU-Hand (Soft Conical Universal Robot Hand), a novel end-effector designed to automate the task of scooping powdered samples from various container sizes using a robotic arm. The SCU-Hand employs a flexible, conical structure that adapts to different container geometries through deformation, maintaining consistent contact without complex force sensing or machine learning-based control methods. Its reconfigurable mechanism allows for size adjustment, enabling efficient scooping from diverse container types. By combining soft robotics principles with a sheet-morphing design, our end-effector achieves high flexibility while retaining the necessary rigidity for effective powder manipulation. We detail the design principles, fabrication process, and experimental validation of the SCU-Hand. Experimental validation showed that the scooping capacity is about 20% higher than that of a commercial tool, with a scooping performance of more than 95% for containers of sizes between 67 mm to 110 mm. This research contributes to laboratory automation by offering a cost-effective, easily implementable solution for automating tasks such as materials synthesis and characterization processes.
|
|
08:35-08:40, Paper WeAT13.2 | |
VSB - Variable Stiffness Based on Bowden Cables: A Simple Mechanism for Soft Robotic Hands |
|
Puhlmann, Steffen | TU Berlin |
Albu-Schäffer, Alin | DLR - German Aerospace Center |
Höppner, Hannes | Berliner Hochschule Für Technik, BHT |
Keywords: Compliant Joints and Mechanisms, Multifingered Hands
Abstract: Soft robotic hands compensate for uncertainty in perception and actuation by leveraging passive deformation in their intrinsically compliant hardware, facilitating robust and dexterous interactions with their environment. The ability to adjust the level of compliance during operation has the potential to further improve the performance of these hands by enabling novel interaction strategies. However, achieving variable stiffness mechanically typically requires significant engineering complexity, making these systems difficult to manufacture, prone to error, and expensive. We present a novel, very simple mechanism for achieving variable stiffness. This mechanism employs tendon-driven antagonistic actuation, with Bowden cables connecting elastic elements to servomotors. It supports compact actuator designs, while the Bowden cables facilitate flexible component placement within a robotic system. Following our approach, variable stiffness actuators can be easily manufactured at low-cost from readily available materials. Despite its simplicity, we demonstrate that our mechanism provides consistent and precise control over stiffness levels and contact torques, showcasing its potential for a broad range of applications in soft robotic systems.
|
|
08:40-08:45, Paper WeAT13.3 | |
Design and Experimental Validation of Woodwork-Inspired Soft Pneumatic Grippers |
|
Stewart-Height, Abriana | Massachusetts Institute of Technology |
Bolli, Roberto | MIT |
Kamienski, Emily | Massachusetts Institute of Technology |
Asada, Harry | MIT |
Keywords: Soft Robot Applications, Physical Human-Robot Interaction, Grippers and Other End-Effectors
Abstract: This paper presents a novel design concept of a pair of soft gripper hands that can establish a secure connection between them for bearing a large load with a low air pressure. The design was inspired by dovetail joints in carpentry that enable a tight, strong connection between two pieces of wood. We propose to mimic the dovetail joint mechanism by using soft robotic fingers that interlace to each other for secure connection. The work was motivated by the need for securing a connection between two soft robotic arms for holding a balance-impaired older adult in case of losing balance. First, the design principle of dovetail-like secure soft finger connection is presented, and its potential application to a portable fall prevention system is described. Details of the dovetail soft finger design, its rapid inflation method, and other implementation issues are then discussed. Through experiments of a proof-of-concept prototype, it is validated that the dovetail soft fingers can bear at least 18 kg of load with only 52 kPa of air chamber pressure filled in 250 ms of charging time. At the end, the proposed method is compared to alternative methods using a Pugh chart.
|
|
08:45-08:50, Paper WeAT13.4 | |
A Variable Stiffness and Transformable Entanglement Soft Robotic Gripper |
|
Zhang, Huayu | The Chinese University of Hong Kong |
Pan, Tianle Flippy | The Chinese University of Hong Kong |
Zhou, Jianshu | University of California, Berkeley |
Liang, Boyuan | University of California, Berkeley |
Shu, Jing | The Chinese University of Hong Kong |
Zhu, Puchen | The Chinese University of Hong Kong |
An, Jiajun | The Chinese University of Hong Kong |
Liu, Yunhui | Chinese University of Hong Kong |
Ma, Xin | Chinese Univerisity of HongKong |
Keywords: Soft Robot Applications, Grippers and Other End-Effectors, Grasping
Abstract: For objects with complex topological and geometrical features, stochastic topological grasping can be executed without the necessity for feedback or precise planning. However, this grasping method has two significant limitations. First, the technique’s effectiveness is reduced when interacting with topologically and geometrically simple objects like spheres, cubes, and cylinders, due to the inherent variability in grasping patterns. Additionally, the method’s low stiffness restricts its ability to securely handling heavier objects. To address these challenges, this paper proposes an entanglement soft robotic gripper with variable stiffness and two transformed grasping modes (entanglement and clamping modes). The gripper contains three filaments, which can enhance the stiffness through the mechanism of layer jamming. Furthermore, the entanglement mode and the clamping mode, can be transformed by adjusting the working length of the filaments. The grasping performance comparison with and without variable stiffness was carried out, and the results indicated that the implementation of variable stiffness led to a 149 % increase in payload weight. Through experimental validation, we successfully employed the gripper in variable stiffness and transformed modes to grasp items with various shapes and weights. Demonstration of grasping heavier objects and transforming between two grasping modes were also conducted to showcase the adaptability and versatility of the gripper.
|
|
08:50-08:55, Paper WeAT13.5 | |
Soft Robotic Dynamic In-Hand Pen Spinning |
|
Yao, Yunchao | Carnegie Mellon University |
Yoo, Uksang | Carnegie Mellon University |
Oh, Jean | Carnegie Mellon University |
Atkeson, Christopher | CMU |
Ichnowski, Jeffrey | Carnegie Mellon University |
Keywords: In-Hand Manipulation, Modeling, Control, and Learning for Soft Robots, Dexterous Manipulation
Abstract: Dynamic in-hand manipulation remains a challenging task for soft robotic systems that have demonstrated advantages in safe compliant interactions but struggle with high-speed dynamic tasks. In this work, we present SoftSpin, a system for dynamic spinning using a soft and compliant robotic hand. Unlike previous works that rely on quasi-static actions and precise object models, the proposed system learns to spin a pen through trial-and-error using only real-world data without requiring explicit prior knowledge of the pen’s physical attributes. With self-labeled trials sampled from the real world, the system discovers the set of pen grasping and spinning primitive parameters that enables a soft hand to spin the pen robustly and reliably. After 130 sampled actions, SoftSpin achieves 100 % success rate across three pens with different weights and weight distributions, demonstrating the system’s generalizability and robustness to changes in object properties. The results highlight the potential for soft robotic end-effectors to perform dynamic tasks including rapid in-hand manipulation. We also demonstrate that SoftSpin generalizes to spinning tools with different shapes and weights such as a brush and a screwdriver which we spin with 10/10 and 5/10 success rates respectively. Videos, data, and code are available at https://soft-spin.github.io
|
|
08:55-09:00, Paper WeAT13.6 | |
Kinetostatics and Retention Force Analysis of Soft Robot Grippers with External Tendon Routing |
|
Gunderman, Anthony | University of Arkansas |
Wang, Yifan | Georgia Institute of Technology |
Gunderman, Benjamin | University of Arkansas |
Qiu, Alex | Georgia Institute of Technology |
Azizkhani, Milad | Georgia Institute of Technology |
Sommer, Joseph | Georgia Institute of Technology |
Chen, Yue | Georgia Institute of Technology |
Keywords: Soft Robot Applications, Modeling, Control, and Learning for Soft Robots, Grippers and Other End-Effectors
Abstract: Soft robots (SR) are a class of continuum robots that enable safe human interaction with task versatility beyond rigid robots. This has resulted in their rapid adoption in a number of applications that require manipulation of delicate and irregular objects. Despite their advantages, SR grippers typically require case-specific experimental characterization for shape and gripper retention force estimation. This letter presents a kinetostatic modeling approach based on strain energy minimization subject to mechanics and geometric constraints for shape estimation of SR grippers with external tendon routing (ETR), including those with composite structures. Additionally, Castigliano's First Theorem is used to estimate the retention force of the gripper. These models are evaluated across four different ETR SR grippers. The mechanics model predicted the fingertip position and orientation with an accuracy of 1.06±0.62 mm (1.79%±1.05% of length) and 3.58°±2.82° with respect to tendon force and 0.72±0.45 mm (1.22%±0.76% of length) and 2.86°±2.11° with respect to tendon retraction. The retention force of the gripper was predicted with an average error of 0.20±0.12 N.
|
|
WeAT14 |
402 |
Teleoperation and Human-Robot Interaction |
Regular Session |
|
08:30-08:35, Paper WeAT14.1 | |
Ego-A3: Adaptive Fusion-Based Disentangled Transformer for Egocentric Action Anticipation |
|
Kim, Min Hyuk | Chonnam National University |
Jung, JongWon | CHONNAM University |
Lee, Eungi | Chonnam National University |
Yoo, Seok Bong | Chonnam National University |
Keywords: Computer Vision for Automation, Deep Learning for Visual Perception, Wearable Robotics
Abstract: Recently, egocentric action anticipation for wearable robotics cameras has gained considerable attention due to its capability to analyze nouns and verbs from a first-person view. However, this field encounters challenges due to various uncertainties, such as action-irrelevant information and semantically fused representations of verbs and nouns. To overcome these issues, we introduce Ego-A3, designed to improve the robustness and reliability of egocentric action anticipation systems. Ego-A3 adaptively extracts action-relevant data to efficiently utilize additional information beyond visual data. Additionally, Ego-A3 produces effective disentangled representations for verbs and nouns by employing learnable verb and noun queries. Experiments on the EpicKitchens-100 and EGTEA Gaze+ datasets demonstrate that Ego-A3 outperforms existing methods in top-1 accuracy and mean top-5 recall. Our code is publicly available at https://github.com/alsgur0720/egocentric_anticipation.
|
|
08:35-08:40, Paper WeAT14.2 | |
A New Variable-Gain Sliding Mode Filter and Its Application to Velocity Filtering |
|
Aung, Myo Thant Sin | Yangon Technological University, Myanmar |
Kikuuwe, Ryo | Hiroshima University |
Paing, Soe Lin | North Carolina State University |
Yang, Jun | National University of Singapore |
Yu, Haoyong | National University of Singapore |
Keywords: Haptics and Haptic Interfaces, Motion Control, Robust/Adaptive Control
Abstract: This paper proposes a new variable gain sliding mode filter augmented by variable windowing for achieving smooth and reactive response over a broad range of input frequencies. The proposed filter can be seen as a synergistic combination of Kikuuwe et al.’s [1] sliding mode filter with varying gain and sliding surfaces and a novel varying-length moving-window algorithm. In all schemes, the estimated input speed is employed for rendering the filter parameters between low and high settings. The discrete-time algorithm of the proposed filter does not suffer from chattering due to implicit (backward) Euler method. The effectiveness of the proposed filter in achieving better trade-off between noise attenuation and signal preservation is validated in both simulation and experimental scenarios by using the velocity signal obtained by differentiation of quantized position data.
|
|
08:40-08:45, Paper WeAT14.3 | |
A Comparative Study between a Virtual Wand and a One-To-One Approach for the Teleoperation of a Nearby Robotic Manipulator |
|
Poignant, Alexis | Sorbonne Université, ISIR UMR 7222 CNRS |
Morel, Guillaume | Sorbonne Université, CNRS, INSERM |
Jarrassé, Nathanael | Sorbonne Université, ISIR UMR 7222 CNRS |
Keywords: Telerobotics and Teleoperation, Physically Assistive Devices
Abstract: The prevailing and most effective approach to teleoperate a robotic arm involves a direct position-to-position mapping, imposing robotic end-effector movements that mirrors those of the user. However, due to this one-to-one mapping, the robot's motions are limited by the user's capability, particularly in translation. Drawing inspiration from head pointers utilized in the 1980s, originally designed to enable drawing with limited head motions for tetraplegic individuals, we proposed a "virtual wand" mapping which could be used by participants with reduced mobility. This mapping employs a virtual rigid linkage between the hand and the robot's end-effector. With this approach, rotations produce amplified translations through a lever arm, creating a "rotation-to-position" coupling and expanding the translation workspace at the expense of a reduced rotation space. In this study, we compare the virtual wand approach to the one-to-one position mapping through the realization of 6-DoF reaching tasks. Results indicate that the two different mappings perform comparably well, are equally well-received by users, and exhibit similar motor control behaviors. Nevertheless, the virtual wand mapping is anticipated to outperform in tasks characterized by large translations and minimal effector rotations, whereas direct mapping is expected to demonstrate advantages in large rotations with minimal translations. These results pave the way for new interactions and interfaces, particularly in disability assistance utilizing residual body movements (instead of hands) as control input. Leveraging body parts with substantial rotations could enable the accomplishment of tasks previously deemed infeasible with standard direct coupling interfaces.
|
|
08:45-08:50, Paper WeAT14.4 | |
A Novel Telelocomotion Framework with CoM Estimation for Scalable Locomotion on Humanoid Robots |
|
He, An-Chi | Virginia Tech |
Li, Junheng | University of Southern California |
Park, Jungsoo | Virginia Tech |
Kolt, Omar | University Southern California |
Beiter, Benjamin | Virginia Polytechnic Institute and State University |
Leonessa, Alexander | Virginia Tech |
Nguyen, Quan | University of Southern California |
Akbari Hamed, Kaveh | Virginia Tech |
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Humanoid and Bipedal Locomotion
Abstract: Teleoperated humanoid robot systems have made substantial advancements in recent years, offering a physical avatar that harnesses human skills and decision-making while safeguarding users from hazardous environments. However, current telelocomotion interfaces often fail to accurately represent the robot's environment, limiting the user’s ability to effectively navigate the robot through unstructured terrain. This paper presents an initial telelocomotion framework that integrates the ForceBot locomotion interface with the small-sized humanoid robot, HECTOR V2. The framework utilizes ForceBot to simulate walking motion and estimate the user’s Center of Mass (CoM) trajectory, which serves as a tracking reference for the robot. On the robot side, a model predictive control (MPC) approach, based on a reduced-order single rigid body model, is employed to track the user’s scaled trajectory. We present experimental results on ForceBot’s CoM estimation and the robot’s tracking performance, demonstrating the feasibility of this approach.
|
|
08:50-08:55, Paper WeAT14.5 | |
Stiffness Regulation Co-Pilot in Bilateral Teleimpedance Control: A Preliminary User Study |
|
Gomez Hernandez, Pedro | Aarhus University Herning |
Jakobsen, Jonas Mariager | SDU Robotics, the Maersk Mc-Kinney Moller Institute, University |
Pacchierotti, Claudio | Centre National De La Recherche Scientifique (CNRS) |
Chinello, Francesco | Aarhus University |
Fang, Cheng | University of Southern Denmark |
Keywords: Telerobotics and Teleoperation, Haptics and Haptic Interfaces, Physical Human-Robot Interaction
Abstract: Variable stiffness of a remote robot is crucial for a teleoperation system to deal with challenging tasks. External stiffness command interfaces have emerged as a promising solution to regulating the remote robot stiffness because of the benefits of their accuracy, ergonomics, and avoidance of the "coupling effect" that usually exists in muscle activity-based stiffness interfaces. However, the use of an external stiffness command interface requires good coordination between two limbs of an operator, which take care of the teleoperation task and the stiffness regulation task, respectively, at the same time, which is demanding for novice operators in dynamic situations necessitating agile and timely stiffness adjustments. In this paper, a new concept of Stiffness Regulation Co-pilot was proposed to facilitate the use of these interfaces. A co-pilot is a virtual agent that consists of a Stiffness Regulation Policy, which infers a reasonable stiffness regulation action from the task performance, and a feedback modality, which conveys the suggested stiffness regulation action to the operator. A preliminary user study was conducted to evaluate the efficacy of the co-pilot and the effect of different modalities of the co-pilot. The results showed that the cutaneous feedback or combined with another modality can potentially improve the task performance of the system and reduce the cognitive load of the operator compared to a teleoperation system without using the co-pilot.
|
|
08:55-09:00, Paper WeAT14.6 | |
Adaptive Neural Network Synchronous Tracking Control for Teleoperation Robots under Event-Triggered Mechanism |
|
Wang, Fujie | Dongguan University of Technology |
Yu, Yuanjia | Shenzhen University |
Li, Xing | School of Electrical Engineering & Intelligentization, Dongguan |
Luo, Junxuan | Dongguan University of Technology |
Zhong, Jinming | Shenzhen University |
Keywords: Motion Control, Human-Robot Collaboration, Grippers and Other End-Effectors
Abstract: This paper proposes an adaptive neural network synchronous tracking control strategy that can be suitable for event-triggered mechanism in response to the modeling uncertainties and communication delays in bilateral teleoperation systems. Through introducing the event-triggered mechanism with the aim of reducing the network communication frequency in teleoperation system, the master and slave robots communicate with each other only when the triggering conditions are fulfilled, which enhances the efficiency of the network communication. This control strategy can guarantee the exponential convergence of the position synchronization tracking error of the master-slave robot end-effector. Moreover, the event-triggered conditions do not require any empirical design, but can be derived inversely with the aid of the Lyapunov stability theory. And the triggering time interval between two neighboring events is verified to be non-zero. It is demonstrated by utilizing the Lyapunov principle that the presented adaptive neural network control strategy ensures the final asymptotic convergence and exponential convergence of the position synchronization tracking error for master-slave robots under the designed event-triggered mechanisms. Eventually, the feasibility and effectiveness of the developed control strategy are validated by comparative cases.
|
|
WeAT15 |
403 |
Bimanual Manipulation 1 |
Regular Session |
|
08:30-08:35, Paper WeAT15.1 | |
Learning Visuotactile Skills with Two Multifingered Hands |
|
Lin, Toru | University of California, Berkeley |
Zhang, Yu | University of California Berkeley |
Li, Qiyang | University of California, Berkeley |
Qi, Haozhi | UC Berkeley |
Yi, Brent | University of California, Berkeley |
Levine, Sergey | UC Berkeley |
Malik, Jitendra | UC Berkeley |
Keywords: Bimanual Manipulation, Dexterous Manipulation, Learning from Demonstration
Abstract: Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing. To tackle the first challenge, we develop HATO, a low-cost hands-arms teleoperation system that leverages off-the-shelf electronics, complemented with a software suite that enables efficient data collection; the comprehensive software suite also supports multimodal data processing, scalable policy learning, and smooth policy deployment. To tackle the latter challenge, we introduce a novel hardware adaptation by repurposing two prosthetic hands equipped with touch sensors for research. Using visuotactile data collected from our system, we learn skills to complete long-horizon, high-precision tasks which are difficult to achieve without multifingered dexterity and touch feedback. Furthermore, we empirically investigate the effects of dataset size, sensing modality, and visual input preprocessing on policy learning. Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data. Videos, code, and datasets can be found on: https://toruowo.github.io/hato
|
|
08:35-08:40, Paper WeAT15.2 | |
Learning Coordinated Bimanual Manipulation Policies Using State Diffusion and Inverse Dynamics Models |
|
Chen, Haonan | University of Illinois at Urbana-Champaign |
Xu, Jiaming | University of Illinois Urbana-Champaign |
Sheng, Lily | Tsinghua University |
Ji, Tianchen | University of Illinois at Urbana-Champaign |
Liu, Shuijing | The University of Texas at Austin |
Li, Yunzhu | Columbia University |
Driggs-Campbell, Katherine | University of Illinois at Urbana-Champaign |
Keywords: AI-Based Methods, Bimanual Manipulation, Imitation Learning
Abstract: When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the predictive nature of human manipulation strategies into robot imitation learning. Specifically, we disentangle task-related state transitions from agent-specific inverse dynamics modeling to enable effective bimanual coordination. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, envisioning how the scene evolves. Then, we use an inverse dynamics model to compute robot actions that achieve the predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.
|
|
08:40-08:45, Paper WeAT15.3 | |
BiFold: Bimanual Cloth Folding with Language Guidance |
|
Barbany, Oriol | IRI (CSIC-UPC) |
Colomé, Adrià | Institut De Robòtica I Informàtica Industrial (CSIC-UPC), Q28180 |
Torras, Carme | Csic - Upc |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Data Sets for Robot Learning
Abstract: Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their complicated dynamics, and the disparate materials, geometries, and textures that garments can have. In this work, we learn folding actions conditioned on text commands. Translating high-level, abstract instructions into precise robotic actions requires sophisticated language understanding and manipulation capabilities. To do that, we leverage a pre-trained vision-language model and repurpose it to predict manipulation actions. Our model, BiFold, can take context into account and achieves stateof-the-art performance on an existing language-conditioned folding benchmark. To address the lack of annotated bimanual folding data, we introduce a novel dataset with automatically parsed actions and language-aligned instructions, enabling better learning of text-conditioned manipulation. BiFold attains the best performance on our dataset and demonstrates strong generalization to new instructions, garments, and environments.
|
|
08:45-08:50, Paper WeAT15.4 | |
One-Shot Dual-Arm Imitation Learning |
|
Wang, Yilong | Imperial College London |
Johns, Edward | Imperial College London |
Keywords: Dual Arm Manipulation, Imitation Learning, Visual Servoing
Abstract: We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at https://www.robot-learning.uk/one-shot-dual-arm.
|
|
08:50-08:55, Paper WeAT15.5 | |
In the Wild Ungraspable Object Picking with Bimanual Nonprehensile Manipulation |
|
Wu, Albert | Stanford University |
Kruse, Daniel | Rensselaer Polytechnic Institute |
Keywords: Dual Arm Manipulation, Mobile Manipulation, Grasping
Abstract: Picking diverse objects in the real world is a fundamental robotics skill. However, many objects in such settings are bulky, heavy, or irregularly shaped, making them ungraspable by conventional end effectors like suction grippers and parallel jaw grippers (PJGs). In this paper, we expand the range of pickable items without hardware modifications using bimanual nonprehensile manipulation. We focus on a grocery shopping scenario, where a bimanual mobile manipulator equipped with a suction gripper and a PJG is tasked with re- trieving ungraspable items from tightly packed grocery shelves. From visual observations, our method first identifies optimal grasp points based on force closure and friction constraints. If the grasp points are occluded, a series of nonprehensile nudging motions are performed to clear the obstruction. A bimanual grasp utilizing contacts on the side of the end effectors is then executed to grasp the target item. In our replica grocery store, we achieved a 90% success rate over 102 trials in uncluttered scenes, and a 66% success rate over 45 trials in cluttered scenes. We also deployed our system to a real-world grocery store and successfully picked previously unseen items. Our results highlight the potential of bimanual nonprehensile manipulation for in-the-wild robotic picking tasks. A video summarizing this work can be found at youtu.be/g0hOrDuK8jM
|
|
08:55-09:00, Paper WeAT15.6 | |
Bimanual Grasp Synthesis for Dexterous Robot Hands |
|
Shao, Yanming | ShanghaiTech University |
Xiao, Chenxi | ShanghaiTech University |
Keywords: Bimanual Manipulation, Grasping, Dexterous Manipulation
Abstract: Humans naturally perform bimanual skills to handle large and heavy objects. To enhance a robot's object manipulation capabilities, generating effective bimanual grasp poses is essential. Nevertheless, bimanual grasp synthesis for dexterous hand manipulators remains underexplored. To bridge this gap, we propose the BimanGrasp algorithm for synthesizing bimanual grasps on 3D objects. The BimanGrasp algorithm generates grasp poses by optimizing an energy function that considers grasp stability and feasibility. Furthermore, the quality of the synthesized grasps is verified using the Isaac Gym physics simulation engine. These verified grasp poses form the BimanGrasp-Dataset, which is the first synthesized bimanual dexterous hand grasp pose dataset to our knowledge. The dataset comprises over 150k verified grasps on 900 objects, facilitating the synthesis of bimanual grasps through a data-driven approach. Last, we propose a diffusion model (BimanGrasp-DDPM) trained on the BimanGrasp-Dataset. This model achieved a grasp synthesis success rate of 69.87% and significant acceleration in computational speed compared to BimanGrasp algorithm.
|
|
WeAT16 |
404 |
Grasping 1 |
Regular Session |
Chair: Kasaei, Hamidreza | University of Groningen |
Co-Chair: Thondiyath, Asokan | IIT Madras |
|
08:30-08:35, Paper WeAT16.1 | |
Efficient 7-DoF Grasp for Target-Driven Object in Dense Cluttered Scenes |
|
Lei, Tianjiao | Chongqing University |
Sun, Yizhuo | Harbin Institute of Technology |
Huang, Yi | Chongqing University |
Huang, Jiangshuai | Nanyang Technological University |
Jiang, Tao | Chongqing University |
Keywords: Grasping, Perception for Grasping and Manipulation, Cyborgs
Abstract: Achieving a real-time precise grasp of a specified target object in densely cluttered environments is an essential capability for autonomous robot operation. Recently, considerable investigations on planar and spatial grasp have been carried out, and significant results have been obtained. However,these point cloud-based grasp prediction methods often fail to ensure that the generated grasp configurations meet the precise requirements of the task. Additionally, some of the existing grasp pipelines are too time-consuming to meet the demand for real-time robot response. In more challenging cluttered scenes,the quality of pose and gripper jaw opening estimation in highdimensional space requires further improvement. Therefore,this paper introduces a data- and model-independent and efficient method to generate 7-DoF grasp configurations for arbitrary target objects from single-view point cloud data in dense cluttered scenes. In addition, this paper proposes a grasp framework that generates the grasp configuration for the target object while reducing the time consumed during the grasp process, to enable robots to efficiently grasp target objects for designated tasks. The grasp pipeline focuses on guided regions via target detection and rapidly adjusts grasp configurations through multi-region point cloud distribution perception. Extensive real-world robot experiments have demonstrated the effectiveness of the proposed method in grasping target objects in cluttered scenes, achieving higher success rates and reduced runtime compared to baseline methods.The realized code and video are available at https://github.com/L-tj/7DGCG.
|
|
08:35-08:40, Paper WeAT16.2 | |
Task-Oriented 6-DoF Grasp Pose Detection in Clutters |
|
Wang, An-Lan | Sun Yat-Sen University |
Chen, Nuo | Sun Yat-Sen University |
Lin, Kun-Yu | Sun Yat-Sen University |
Yuan-Ming, Li | Sun Yat-Sen University |
Zheng, Wei-Shi | Sun Yat-Sen University |
Keywords: Grasping
Abstract: In general, humans would grasp an object differently for different tasks, e.g., ``grasping the handle of a knife to cut'' vs. ``grasping the blade to hand over''. In the field of robotic grasp pose detection research, some existing works consider this task-oriented grasping and made some progress, but they are generally constrained by low-DoF gripper type or non-cluttered setting, which is not applicable for human assistance in real life. With an aim to get more general and practical grasp models, in this paper, we investigate a new problem named Task-Oriented 6-DoF Grasp Pose Detection in Clutters (TO6DGC), which extends the task-oriented problem to a more general 6-DOF Grasp Pose Detection in Cluttered (multi-object) scenario. To this end, we construct a large-scale 6-DoF task-oriented grasping dataset, 6-DoF Task Grasp (6DTG), which features 4391 cluttered scenes with over 2 million 6-DoF grasp poses. Each grasp is annotated with a specific task, involving 6 tasks and 198 objects in total. Moreover, we propose One-Stage TaskGrasp (OSTG), a strong baseline to address the TO6DGC problem. Our OSTG adopts a task-oriented point selection strategy to detect where to grasp, and a task-oriented grasp generation module to decide how to grasp given a specific task. To evaluate the effectiveness of OSTG, extensive experiments are conducted on 6DTG. The results show that our method outperforms various baselines on multiple metrics. Real robot experiments also verify that our OSTG has a better perception of the task-oriented grasp points and 6-DoF grasp poses.
|
|
08:40-08:45, Paper WeAT16.3 | |
QuickGrasp: Lightweight Antipodal Grasp Planning with Point Clouds |
|
Ravie, Navin Sriram | Indian Institute of Technology Madras |
Murugan, Keerthi Vasan | Indian Institute of Technology Madras |
Thondiyath, Asokan | IIT Madras |
Sebastian, Bijo | IIT Madras |
Keywords: Grasping, Manipulation Planning, Perception for Grasping and Manipulation
Abstract: Grasping has been a long-standing challenge in facilitating the final interface between a robot and the environment. As environments and tasks become complicated, the need to embed higher intelligence to infer from the surroundings and act on them has become necessary. Although most methods utilize techniques to estimate grasp pose by treating the problem via pure sampling-based approaches in the six-degree-of-freedom space or as a learning problem, they usually fail in real-life settings owing to poor generalization across domains. In addition, the time taken to generate the grasp plan and the lack of repeatability, owing to sampling inefficiency and the probabilistic nature of existing grasp planning approaches, severely limits their application in real-world tasks. This paper presents a lightweight analytical approach towards robotic grasp planning, particularly antipodal grasps, with little to no sampling in the six-degree-of-freedom space. The proposed grasp planning algorithm is formulated as an optimization problem towards estimating grasp points on the object surface instead of directly estimating the end-effector pose. To this extent, a soft-region-growing algorithm is presented for effective plane segmentation, even in the case of curved surfaces. An optimization-based quality metric is then used for evaluation of grasp points to ensure indirect force closure. The proposed grasp framework is compared with existing state-of-the-art grasp planning approach Grasp pose detection (GPD) as baseline over multiple simulated objects. The effectiveness of the proposed approach in comparison to GPD is also evaluated in real-world setting using image and point-cloud data, with the planned grasps being executed using a ROBOTIQ gripper and UR5 manipulator. The proposed approach shows better performance in terms of higher probability for force closure with a complete repeatability.
|
|
08:45-08:50, Paper WeAT16.4 | |
Behavioral Manifolds: Representing the Landscape of Grasp Affordances in Relative Pose Space |
|
Zechmair, Michael | Maastricht University |
Morel, Yannick | Maastricht University |
Keywords: Grasping, Grippers and Other End-Effectors, Manipulation Planning
Abstract: The use of machine learning to investigate grasp affordances has received extensive attention over the past several decades. The existing literature provides a robust basis to build upon, though a number of aspects may be improved. Results commonly work in terms of grasp configuration, with little consideration for the manner in which the grasp may be (re-)produced, from a reachability and trajectory planning perspective. We propose a different perspective on grasp affordance learning, explicitly accounting for grasp synthesis; that is, the manner in which manipulator kinematics are used to allow materialization of grasps. The approach allows to explicitly map the grasp policy space in terms of generated grasp types and associated grasp quality. Results of application to a range of objects illustrate merit of the method and highlight the manner in which it may promote a greater degree of explainability for otherwise intransparent reinforcement processes.
|
|
08:50-08:55, Paper WeAT16.5 | |
NeRF-Based Transparent Object Grasping Enhanced by Shape Priors |
|
Han, Yi | Shenzhen Technology University |
Lin, Zixin | Shenzhen Technology University |
Li, DongJie | Shenzhen Technology University |
Chen, Lvping | Shenzhen Technology Universit |
Shi, Yongliang | Tsinghua University |
Ma, Gan | Shenzhen Technology University |
Keywords: Grasping
Abstract: Transparent object grasping remains a persistent challenge in robotics, largely due to the difficulty of acquiring precise 3D information. Conventional optical 3D sensors struggle to capture transparent objects, and machine learning methods are often hindered by their reliance on high-quality datasets. Leveraging NeRF’s capability for continuous spatial opacity modeling, our proposed architecture integrates a NeRF-based approach for reconstructing the 3D information of transparent objects. Despite this, certain portions of the reconstructed 3D information may remain incomplete. To address these deficiencies, we introduce a shape-prior-driven completion mechanism, further refined by a geometric pose estimation method we have developed. This allows us to obtain a complete and reliable 3D information of transparent objects. Utilizing this refined data, we perform scene-level grasp prediction and deploy the results in real-world robotic systems. Experimental validation demonstrates the efficacy of our architecture, showcasing its capability to reliably capture 3D information of various transparent objects in cluttered scenes, and correspondingly, achieve high-quality, stable, and executable grasp predictions.
|
|
08:55-09:00, Paper WeAT16.6 | |
Center Direction Network for Grasping Point Localization on Cloths |
|
Tabernik, Domen | University of Ljubljana |
Muhovič, Jon | Faculty of Electrical Engineering, University of Ljubljana |
Urbas, Matej | University of Ljubljana, Faculty of Computer and Information Sci |
Skocaj, Danijel | University of Ljubljana |
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, RGB-D Perception
Abstract: Object grasping is a fundamental challenge in robotics and computer vision, critical for advancing robotic manipulation capabilities. Deformable objects, like fabrics and cloths, pose additional challenges due to their non-rigid nature. In this work, we introduce CeDiRNet-3DoF, a deep-learning model for grasp point detection, with a particular focus on cloth objects. CeDiRNet-3DoF employs center direction regression alongside a localization network, attaining first place in the perception task of ICRA 2023's Cloth Manipulation Challenge. Recognizing the lack of standardized benchmarks in the literature that hinder effective method comparison, we present the ViCoS Towel Dataset. This extensive benchmark dataset comprises 8,000 real and 12,000 synthetic images, serving as a robust resource for training and evaluating contemporary data-driven deep-learning approaches. Extensive evaluation revealed CeDiRNet-3DoF's robustness in real-world performance, outperforming state-of-the-art methods, including the latest transformer-based models. Our work bridges a crucial gap, offering a robust solution and benchmark for cloth grasping in computer vision and robotics.
|
|
WeAT17 |
405 |
Localization 3 |
Regular Session |
Chair: Joerger, Mathieu | Virginia Tech |
Co-Chair: Halperin, Dan | Tel Aviv University |
|
08:30-08:35, Paper WeAT17.1 | |
How Safe Is Particle Filtering-Based Localization for Mobile Robots? an Integrity Monitoring Approach |
|
Abdul Hafez, Osama | American University of Madaba |
Joerger, Mathieu | Virginia Tech |
Spenko, Matthew | Illinois Institute of Technology |
Keywords: Localization, Probability and Statistical Methods, Robot Safety, Autonomous Vehicle Navigation
Abstract: Deriving safe bounds on particle filter estimate is a research problem that, if solved, could greatly benefit robots in life-critical applications, a field that is facing increasing interest as more robots are being deployed near humans. In response, this paper introduces a new fault detector and derives a performance measure for particle filter: integrity risk. Integrity risk is defined as the probability of having large estimate errors without triggering an alarm, all while considering measurement faults, unknown deterministic errors that cannot be modeled via normal white noise. In this work, the faults come in the form of incorrectly associated features when using the local nearest neighbors. Simulations and experiments assess the efficiency of the introduced safety metric. The results show that safety improves as map density increases as long as the number of particles is sufficient to shape the error distribution and the landmarks are well separated. Also, the results indicate that, when landmarks are poorly separated, particle filter is safer than Kalman filter, whereas, when landmarks are well separated, particle filter is often, but not always, safer than Kalman filter.
|
|
08:35-08:40, Paper WeAT17.2 | |
Lighthouse Localization of Miniature Wireless Robots |
|
Alvarado-Marin, Said | INRIA |
Huidobro-Marin, Cristobal | INRIA |
Balbi, Martina | INRIA |
Savic, Trifun | INRIA |
Watteyne, Thomas | Inria |
Maksimovic, Filip | INRIA |
Keywords: Localization, Multi-Robot Systems, Wheeled Robots
Abstract: In this paper, we apply lighthouse localization, originally designed for virtual reality motion tracking, to positioning and localization of indoor robots. We first present a lighthouse decoding and tracking algorithm on a low-power wireless microcontroller with hardware implemented in a cm-scale form factor. One-time scene solving is performed on a computer using a variety of standard computer vision tech-niques. Three different robotic localization scenarios are analyzed in this work. The first is a planar scene with a single lighthouse with a four-point pre-calibration. The second is a planar scene with two light-houses that self calibrates with either multiple robots in the experiment or a single robot in motion. The third extends to a 3D scene with two lighthouses and a self-calibration algorithm. The absolute accuracy, measured against a camera-based tracking system, was found to be 7.25 mm RMS for the 2D case and 11.2 mm RMS for the 3D case, respectively. This demonstrates the viability of lighthouse tracking both for small-scale robotics and as an inexpensive and compact alternative to camera-based setups.
|
|
08:40-08:45, Paper WeAT17.3 | |
EVLoc: Event-Based Visual Localization in LiDAR Maps Via Event-Depth Registration |
|
Chen, Kuangyi | Graz University of Technology |
Zhang, Jun | Graz University of Technology |
Fraundorfer, Friedrich | Graz University of Technology |
Keywords: Localization, Deep Learning for Visual Perception
Abstract: Event cameras are bioinspired sensors with some notable features, including high dynamic range and low latency, which makes them exceptionally suitable for perception in challenging scenarios such as high-speed motion and extreme lighting conditions. In this paper, we explore their potential for localization within pre-existing LiDAR maps, a critical task for applications that require precise navigation and mobile manipulation. Our framework follows a paradigm based on the refinement of an initial pose. Specifically, we first project LiDAR points into 2D space based on a rough initial pose to obtain depth maps, and then employ an optical flow estimation network to align events with LiDAR points in 2D space, followed by camera pose estimation using a PnP solver. To enhance geometric consistency between these two inherently different modalities, we develop a novel frame-based event representation that improves structural clarity. Additionally, given the varying degrees of bias observed in the ground truth poses, we design a module that predicts an auxiliary variable as a regularization term to mitigate the impact of this bias on network convergence. Experimental results on several public datasets demonstrate the effectiveness of our proposed method. To facilitate future research, both the code and the pre-trained models are made available online.
|
|
08:45-08:50, Paper WeAT17.4 | |
MambaGlue: Fast and Robust Local Feature Matching with Mamba |
|
Ryoo, Kihwan | Korea Advanced Institute of Science and Technology |
Lim, Hyungtae | Massachusetts Institute of Technology |
Myung, Hyun | KAIST (Korea Advanced Institute of Science and Technology) |
Keywords: Localization, Deep Learning for Visual Perception, Recognition
Abstract: In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on https://github.com/url-kaist/MambaGlue.
|
|
08:50-08:55, Paper WeAT17.5 | |
ULOC: Learning to Localize in Complex Large-Scale Environments with Ultra-Wideband Ranges |
|
Nguyen, Thien-Minh | Nanyang Technological University |
Yang, Yizhuo | Nangyang Technological Univercity |
Nguyen, Tien-Dat | Ho Chi Minh City University of Technology (HCMUT), VNU-HCM |
Yuan, Shenghai | Nanyang Technological University |
Xie, Lihua | NanyangTechnological University |
Keywords: Localization, Range Sensing, Autonomous Vehicle Navigation
Abstract: While UWB-based methods can achieve high localization accuracy in small-scale areas, their accuracy and reliability are significantly challenged in large-scale environments. In this paper, we propose a learning-based framework for Ultra-Wideband (UWB) based localization in such complex large-scale environments, named ULOC. First, anchors are deployed in the environment without knowledge of their actual position. Then, UWB observations are collected when the vehicle travels in the environment. At the same time, map-consistent pose estimates are developed from registering (onboard self-localization) data with the prior map to provide the training labels. We then propose a recurrent neural network (RNN) based on MAMBA that learns the ranging patterns of UWBs over a complex large-scale environment. The experiment demonstrates that our solution can ensure high localization accuracy on a large scale compared to the state-of-the-art. We release our source code to benefit the community at https://github.com/brytsknguyen/uloc.
|
|
08:55-09:00, Paper WeAT17.6 | |
Indoor Localization of UAVs Using Only Few Measurements by Output-Sensitive Preimage Intersection |
|
Bilevich, Michael M. | Tel Aviv University |
Buber, Tomer | Tel Aviv University |
Halperin, Dan | Tel Aviv University |
Keywords: Localization
Abstract: We present a deterministic approach for the localization of an Unmanned Aerial Vehicle (UAV) in a known indoor environment by using only a few downward distance measurements and the corresponding odometries between measurements. For each distance measurement and odometry, we look at the preimage of that distance measurement under the downwards distance function combined with the corresponding odometry where the motion between every two measurements has four degrees of freedom: three of translation and one of azimuth change. The intersection of these preimages yields the set of all possible locations for the UAV. In this work, we present an efficient method for approximating that intersection of preimages. We perform a spatial subdivision search, which splits only voxels containing that intersection. We present a novel technique, based on geometric insights, for correctly evaluating whether a voxel indeed contains a true localization. This technique is also robust under different kinds of errors that might occur. Our method is guaranteed to contain the ground truth location, and its runtime complexity is output sensitive, in the Hausdorff dimension and measure of the resulting intersection of preimages. We demonstrate the effectiveness of this method in various indoor scenarios, showing that it can be used to significantly decrease the uncertainty of localization when solving the kidnapped robot problem in simulation and on a physical drone. Our method can be performed in real-time. Furthermore, our method requires only a map of the environment, odometry and ToF sensors, which is advantageous in terms of cost, privacy and transmission bandwidth. Our open-source software and supplementary materials are available at https://github.com/TAU-CGL/uav-fdml-public.
|
|
WeAT18 |
406 |
Software Tools 1 |
Regular Session |
|
08:30-08:35, Paper WeAT18.1 | |
Motion Comparator: Visual Comparison of Robot Motions |
|
Wang, Yeping | University of Wisconsin-Madison |
Peseckis, Alexander | University of Wisconsin -- Madison |
Jiang, Zelong | University of Wisconsin-Madison |
Gleicher, Michael | University of Wisconsin - Madison |
Keywords: Software Tools for Robot Programming, Software Tools for Benchmarking and Reproducibility
Abstract: Roboticists compare robot motions for tasks such as parameter tuning, troubleshooting, and deciding between possible motions. However, most existing visualization tools are designed for individual motions and lack the features necessary to facilitate robot motion comparison. In this paper, we follow a rigorous design process to create Motion Comparator, a web-based tool that facilitates the comprehension, comparison, and communication of robot motions. Our design process identified roboticists' needs, articulated design challenges, and provided corresponding strategies. Motion Comparator includes several key features such as multi-view coordination, quaternion visualization, time warping, and comparative designs. To demonstrate the applications of Motion Comparator, we discuss four case studies in which our tool is used for motion selection, troubleshooting, parameter tuning, and motion review.
|
|
08:35-08:40, Paper WeAT18.2 | |
Text2Robot: Evolutionary Robot Design from Text Descriptions |
|
Chen, Boyuan | Duke University |
Charlick, Zachary Samuel Charlick | Duke University |
Ringel, Ryan | Duke University |
Liu, Jiaxun | Duke University |
Xia, Boxi | Duke University |
Keywords: Methods and Tools for Robot System Design, Evolutionary Robotics
Abstract: Robot design has traditionally been costly and labor-intensive. Despite advancements in automated processes, it remains challenging to navigate a vast design space while producing physically manufacturable robots. We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots. Within minutes, Text2Robot can use text-to-3D models to provide strong initializations of diverse morphologies. Within a day, our geometric processing algorithms and body-control co-optimization produce a walking robot by explicitly considering real-world electronics and manufacturability. Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models.
|
|
08:40-08:45, Paper WeAT18.3 | |
QueryCAD: Grounded Question Answering for CAD Models |
|
Kienle, Claudius | ArtiMinds Robotics GmbH |
Alt, Benjamin | ArtiMinds Robotics |
Katic, Darko | HFT STUTTGART |
Jäkel, Rainer | Karlsruhe Institute of Technology |
Peters, Jan | Technische Universität Darmstadt |
Keywords: Deep Learning Methods, Engineering for Robotic Systems, Software Tools for Robot Programming
Abstract: CAD models are widely used in industry and are essential for robotic automation processes. However, these models are rarely considered in novel AI-based approaches, such as the automatic synthesis of robot programs, as there are no readily available methods that would allow CAD models to be incorporated for the analysis, interpretation, or extraction of information. To address these limitations, we propose QueryCAD, the first system designed for CAD question answering, enabling the extraction of precise information from CAD models using natural language queries. QueryCAD incorporates SegCAD, an open-vocabulary instance segmentation model we developed to identify and select specific parts of the CAD model based on part descriptions. We further propose a CAD question answering benchmark to evaluate QueryCAD and establish a foundation for future research. Lastly, we integrate QueryCAD within an automatic robot program synthesis framework, validating its ability to enhance deep-learning solutions for robotics by enabling them to process CAD models.
|
|
08:45-08:50, Paper WeAT18.4 | |
HeRo: A State Machine-Based, Fault-Tolerant Framework for Heterogeneous Multi-Robot Collaboration |
|
Tang, Ruijie | Institute of Software, Chinese Academy of Sciences |
Wu, Guoquan | Institute of Software, Chinese Academy of Sciences |
Wang, Tao | Institute of Software, Chinese Academy of Sciences |
Chen, Wei | Institute of Software, Chinese Academy of Sciences |
Wei, Jun | Institute of Software, Chinese Academy of Sciences |
Keywords: Software Tools for Robot Programming, Software, Middleware and Programming Environments, Multi-Robot Systems
Abstract: Heterogeneous robots can work together to accomplish a variety of complex tasks and have shown great potential in many fields. There are many efforts to make robot task orchestration more efficient. However, current methods still have some limitations, including the lack of a high-level abstraction for programming method and fault handling mechanism. In this paper, we design a state machine-based, fault-tolerant framework for heterogeneous multi-robot collaboration named HeRo, to effectively support the development of heterogeneous multi-robot systems. HeRo has three key techniques: (1) a state machine-based programming language to flexibly model robot behaviors and tasks; (2) a state synchronization mechanism to achieve information exchange and maintain the consistency among heterogeneous robots in distributed environments; (3) a fault detection and recovery mechanism to monitor the system's runtime states and use Large Language Model (LLM) combined with Planning Domain Definition Language (PDDL) to enable automated recovery. We evaluate the effectiveness and fault recovery capability of the framework by setting up manufacturing task and fault scenarios with varying difficulty in the ARIAC simulation environment, achieving a 100% task completion rate, with low system overhead and flexible scalability.
|
|
08:50-08:55, Paper WeAT18.5 | |
A Kinematics Optimization Framework with Improved Computational Efficiency for Task-Based Optimum Design of Serial Manipulators in Cluttered Environments |
|
Petkov, Nikola | United Kingdom Atomic Energy Authority |
Tokatli, Ozan | UKAEA |
Zhang, Kaiqiang | UK Atomic Energy Authority |
Wu, Huapeng | Lappeenranta University of Technology |
Skilton, Robert Mark | UK Atomic Energy Authority |
Keywords: Methods and Tools for Robot System Design, Engineering for Robotic Systems, Optimization and Optimal Control
Abstract: It is challenging to find optimum kinematic designs for non-standard robotic manipulators, e.g., medical, nuclear, and space manipulators, which are demanded to adapt to arbitrary complex tasks in constraints. Such design optimization can be modelled as a multi-dimensional non-convex optimization problem with nonlinear constrained conditions. However, it is non-trivial to ensure the essential reachability condition, i.e., the existence of continuous trajectories between demand positions for serial articulated manipulators, given complex spatial constraints, like obstacles and boundaries. Traditional solutions integrate standard motion planning or inverse kinematics algorithms within a kinematic-design optimization process, resulting in significant demand for time and computing resources. To accelerate design optimization at improved efficiency, we design a novel robust design framework built on a new kinematic design synthesis, which allows for simultaneously optimizing dimension and topology of a serial manipulator's kinematics for arbitrary tasks in constrained environments, using a generalised parametric kinematic model. Significantly, in contrast to standard solutions, we develop a novel computationally effective reachability verification method, which rapidly aborts infeasible motions by exploiting efficient collision checks, based on the Rapidly-exploring Random Tree (RRT) algorithm. The effectiveness of the proposed design framework is verified and evaluated by comparing to baseline benchmarks. Results demonstrate the novel design framework can accelerate kinematic design optimization by an order of magnitude compared to the current state-of-the-art, and optimise link dimension and joint type simultaneously of serial robots for cluttered environments.
|
|
08:55-09:00, Paper WeAT18.6 | |
A Survey on Small-Scale Testbeds for Connected and Automated Vehicles and Robot Swarms |
|
Mokhtarian, Armin | RWTH Aachen University |
Xu, Jianye | Chair of Embedded Software (Informatik 11), RWTH Aachen Universi |
Scheffe, Patrick | RWTH Aachen University |
Kloock, Maximilian | RWTH Aachen University |
Schäfer, Simon | RWTH Aachen University |
Bang, Heeseung | University of Delaware |
Le, Viet-Anh | University of Delaware |
Ulhas, Sangeet | Arizona State University |
Betz, Johannes | Technical University of Munich |
Wilson, Sean | Georgia Institute of Technology, Georgia Tech Research Institute |
Berman, Spring | Arizona State University |
Paull, Liam | Université De Montréal |
Prorok, Amanda | University of Cambridge |
Alrifaee, Bassam | University of the Bundeswehr Munich |
Keywords: Embedded Systems for Robotic and Automation, Engineering for Robotic Systems, Methods and Tools for Robot System Design
Abstract: Connected and automated vehicles and robot swarms hold transformative potential for enhancing safety, efficiency, and sustainability in the transportation and manufacturing sectors. Extensive testing and validation of these technologies is crucial for their deployment in the real world. While simulations are essential for initial testing, they often have limitations in capturing the complex dynamics of real-world interactions. This limitation underscores the importance of small-scale testbeds. These testbeds provide a realistic, cost-effective, and controlled environment for testing and validating algorithms, acting as an essential intermediary between simulation and full-scale experiments. This work serves to facilitate researchers' efforts in identifying existing small-scale testbeds suitable for their experiments and provide insights for those who want to build their own. In addition, it delivers a comprehensive survey of the current landscape of these testbeds. We derive 62 characteristics of testbeds based on the well-known sense-plan-act paradigm and offer an online table comparing 23 small-scale testbeds based on these characteristics. The online table is hosted on our designated public webpage https://bassamlab.github.io/testbeds-survey, and we invite testbed creators and developers to contribute to it. We closely examine nine testbeds in this paper, demonstrating how the derived characteristics can be used to present testbeds. Furthermore, we discuss three ongoing ch
|
|
WeAT19 |
407 |
Tactile Sensing 2 |
Regular Session |
Chair: Spiers, Adam | Imperial College London |
|
08:30-08:35, Paper WeAT19.1 | |
ACROSS: A Deformation-Based Cross-Modal Representation for Robotic Tactile Perception |
|
Zai El Amri, Wadhah | L3S Research Center |
Kuhlmann, Malte Fabian | L3S Research Center |
Navarro-Guerrero, Nicolás | Leibniz Universität Hannover |
Keywords: Transfer Learning, Force and Tactile Sensing, Representation Learning
Abstract: Tactile perception is essential for human interaction with the environment and is becoming increasingly crucial in robotics. Tactile sensors like the BioTac mimic human fingertips and provide detailed interaction data. Despite its utility in applications like slip detection and object identification, this sensor is now deprecated, making many valuable datasets obsolete. However, recreating similar datasets with newer sensor technologies is both tedious and time-consuming. Therefore, adapting these existing datasets for use with new setups and modalities is crucial. In response, we introduce ACROSS, a novel framework for translating data between tactile sensors by exploiting sensor deformation information. We demonstrate the approach by translating BioTac signals into the DIGIT sensor. Our framework consists of first converting the input signals into 3D deformation meshes. We then transition from the 3D deformation mesh of one sensor to the mesh of another, and finally convert the generated 3D deformation mesh into the corresponding output space. We demonstrate our approach to the most challenging problem of going from a low-dimensional tactile representation to a high-dimensional one. In particular, we transfer the tactile signals of a BioTac sensor to DIGIT tactile images. Our approach enables the continued use of valuable datasets and data exchange between groups with different setups.
|
|
08:35-08:40, Paper WeAT19.2 | |
Learning to Double Guess: An Active Perception Approach for Estimating the Center of Mass of Arbitrary Object |
|
Jin, Shengmiao | University of Illinois Urbana-Champaign |
Mo, Yuchen | University of Illinois, Urbana-Champaign |
Yuan, Wenzhen | University of Illinois |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Perception-Action Coupling
Abstract: Manipulating arbitrary objects in unstructured environments is a significant challenge in robotics, primarily due to difficulties in determining an object's center of mass. This paper introduces U-GRAPH: Uncertainty-Guided Rotational Active Perception with Haptics, a novel framework to enhance the center of mass estimation using active perception. Traditional methods often rely on singular interactions and are limited by the inherent inaccuracies of Force-Torque (F/T) sensors. Our approach circumvents these limitations by integrating a Bayesian Neural Network (BNN) to quantify uncertainty and guide the robotic system through multiple, information-rich interactions via grid search and ActiveNet. We demonstrate the remarkable generalizability and transferability of our method with training on a small dataset with limited variation yet still perform well on unseen complex real-world objects.
|
|
08:40-08:45, Paper WeAT19.3 | |
Learning In-Hand Translation Using Tactile Skin with Shear and Normal Force Sensing |
|
Yin, Jessica | University of Pennsylvania |
Qi, Haozhi | UC Berkeley |
Malik, Jitendra | UC Berkeley |
Pikul, James | University of Pennsylvania |
Yim, Mark | University of Pennsylvania |
Hellebrekers, Tess | Meta AI Research |
Keywords: Force and Tactile Sensing, In-Hand Manipulation, Reinforcement Learning
Abstract: Recent progress in reinforcement learning (RL) and tactile sensing has significantly advanced dexterous manipulation. However, these methods often utilize simplified tactile signals due to the gap between tactile simulation and the real world. We introduce a sensor model for tactile skin that enables zero-shot sim-to-real transfer of ternary shear and binary normal forces. Using this model, we develop an RL policy that leverages sliding contact for dexterous in-hand translation. We conduct extensive real-world experiments to assess how tactile sensing facilitates policy adaptation to various unseen object properties and robot hand orientations. We demonstrate that our 3-axis tactile policies consistently outperform baselines that use only shear forces, only normal forces, or only proprioception. Videos and details available on the project website.
|
|
08:45-08:50, Paper WeAT19.4 | |
Contrastive Touch-To-Touch Pretraining |
|
Rodriguez, Samanta | University of Michigan - Ann Arbor |
Dou, Yiming | University of Michigan |
van den Bogert, William | University of Michigan |
Oller, Miquel | University of Michigan |
So, Kevin | University of Michigan |
Owens, Andrew | University of Michigan |
Fazeli, Nima | University of Michigan |
Keywords: Representation Learning, Force and Tactile Sensing, Deep Learning in Grasping and Manipulation
Abstract: Tactile sensors differ greatly in design, making it challenging to develop general-purpose methods for processing tactile feedback. In this paper, we introduce a contrastive self-supervised learning approach that represents tactile feedback across different sensor types. Our method utilizes paired tactile data—where two distinct sensors, in our case Soft Bubbles and GelSlims, grasp the same object in the same configuration—to learn a unified latent representation. Unlike current approaches that focus on reconstruction or task-specific supervision, our method employs contrastive learning to create a latent space that captures shared information between sensors. By treating paired tactile signals as positives and unpaired signals as negatives, we show that our model effectively learns a rich, sensor-agnostic representation. Despite significant differences between Soft Bubble and GelSlim sensors, the learned representation enables strong downstream task performance, including zero-shot and few-shot classification and pose estimation. This work provides a scalable solution for integrating tactile data across diverse sensor modalities, advancing the development of generalizable tactile representations.
|
|
08:50-08:55, Paper WeAT19.5 | |
ViTract: Robust Object Shape Perception Via Active Visuo-Tactile Interaction |
|
Dutta, Anirvan | BMW Group and Imperial College London |
Burdet, Etienne | Imperial College London |
Kaboli, Mohsen | Eindhoven University of Technology ( TU/e) & BMW Group Research |
Keywords: Perception for Grasping and Manipulation, Force and Tactile Sensing
Abstract: An essential problem in robotic systems that are to be deployed in unstructured environments is the accurate and autonomous perception of the shapes of previously unseen objects. Existing methods for shape estimation or reconstruction have leveraged either visual or tactile interactive exploration techniques, or have relied on comprehensive visual or tactile information acquired in an offline manner. In this work, a novel visuo-tactile interactive perception framework - ViTract is introduced for shape estimation of unseen objects. Our framework estimates the shape of diverse objects robustly using low-dimensional, efficient, and generalizable shape primitives, which are superquadrics. The probabilistic formulation within our framework takes advantage of the complementary information provided by vision and tactile observations while accounting for associated noise. As part of our framework, we propose a novel modality-specific information gain to select the most informative and reliable exploratory action (using vision/tactile) to obtain iterative visuo/tactile information. Our real-robot experiments demonstrate superior and robust performance compared to state-of-the-art visuo-tactile-based shape estimation techniques.
|
|
08:55-09:00, Paper WeAT19.6 | |
Location and Orientation Super-Resolution Sensing with a Cost-Efficient and Repairable Barometric Tactile Sensor |
|
Hou, Jian | Imperial College London |
Zhou, Xin | Imperial College London |
Spiers, Adam | Imperial College London |
Keywords: Force and Tactile Sensing, Perception for Grasping and Manipulation, Grippers and Other End-Effectors, Barometric Sensing
Abstract: The adoption of tactile sensors in robotics is hindered by their high cost and fragility. We designed and validated a cost-effective and robust barometric tactile sensor array, whose material cost is below 80 USD. Unlike past work, we do not mold the rubber surface over the barometers but instead keep it as a separate element, leading to a design that is easy to fabricate and repair. Machine learning techniques are applied to enhance the sensor’s localization precision, increasing the effective resolution from 6 mm (the distance between adjacent barometers) to 0.284 mm. To investigate the localization model’s robustness, we utilized an E-TRoll robotic gripper to roll differently shaped prismatic objects across the sensing surface mounted on one finger. Under these uncontrolled settings, we achieved a satisfactory real-time localization resolution of within 2.68 mm. Furthermore, we demonstrate a novel practical application: The E-TRoll mimics a 1-DoF parallel gripper inferring a cube’s orientation relative to the sensor. The range of orientations is split into 4 classes, which a trained CNN-LSTM model can predict with an 86.91% five-fold cross-validated accuracy.
|
|
WeAT20 |
408 |
Human Motion Sensing |
Regular Session |
Chair: Youcef-Toumi, Kamal | Massachusetts Institute of Technology |
Co-Chair: Cao, Muqing | Carnegie Mellon University |
|
08:30-08:35, Paper WeAT20.1 | |
Person Re-Identification for Robot Person Following with Online Continual Learning |
|
Ye, Hanjing | Southern University of Science and Technology |
Zhao, Jieting | Southern University of Science and Technology |
Zhan, Yu | Southern University of Science and Technology |
Chen, Weinan | Guangdong University of Technology |
He, Li | Southern University of Science and Technology |
Zhang, Hong | Southern University of Science and Technology |
Keywords: Human-Centered Automation, Computer Vision for Automation, Continual Learning
Abstract: Robot person following (RPF) is a crucial capability in human-robot interaction (HRI) applications, allowing a robot to persistently follow a designated person. In practical RPF scenarios, the person can often be occluded by other objects or people. Consequently, it is necessary to re-identify the person when he/she reappears within the robot's field of view. Previous person re-identification (ReID) approaches to person following rely on a fixed feature extractor. Such an approach often fails to generalize to different viewpoints and lighting conditions in practical RPF environments. In other words, it suffers from the so-called domain shift problem where it cannot re-identify the person when his re-appearance is out of the domain modeled by the fixed feature extractor. To mitigate this problem, we propose a ReID framework for RPF where we use a feature extractor that is optimized online with both short-term and long-term experiences (i.e., recently and previously observed samples during RPF) using the online continual learning (OCL) framework. The long-term experiences are maintained by a memory manager to enable OCL to update the feature extractor. Our experiments demonstrate that even in the presence of severe appearance changes and distractions from visually similar people, the proposed method can still re-identify the person more accurately than the state-of-the-art methods.
|
|
08:35-08:40, Paper WeAT20.2 | |
HelmetPoser: A Helmet-Mounted IMU Dataset for Data-Driven Estimation of Human Head Motion in Diverse Conditions |
|
Li, Jianping | Nanyang Technological University |
Leng, Qiutong | Nanyang Technological University |
Liu, Jinxin | Nanyang Technological University |
Xu, Xinhang | Nanyang Technological University |
Jin, Tongxing | Nanyang Technological University |
Cao, Muqing | Carnegie Mellon University |
Nguyen, Thien-Minh | Nanyang Technological University |
Yuan, Shenghai | Nanyang Technological University |
Cao, Kun | Nanyang Technological University |
Xie, Lihua | NanyangTechnological University |
Keywords: Datasets for Human Motion, Wearable Robotics, SLAM
Abstract: Helmet-mounted wearable positioning systems are crucial for enhancing safety and facilitating coordination in industrial, construction, and emergency rescue environments. These systems, including LiDAR-Inertial Odometry (LIO) and Visual-Inertial Odometry (VIO), often face challenges in localization due to adverse environmental conditions such as dust, smoke, and limited visual features. To address these limitations, we propose a novel head-mounted Inertial Measurement Unit (IMU) dataset with ground truth, aimed at advancing data-driven IMU pose estimation. Our dataset captures human head motion patterns using a helmet-mounted system, with data from ten participants performing various activities. We explore the application of neural networks, specifically Long Short-Term Memory (LSTM) and Transformer networks, to correct IMU biases and improve localization accuracy. Additionally, we evaluate the performance of these methods across different IMU data window dimensions, motion patterns, and sensor types. We release a publicly available dataset, demonstrate the feasibility of advanced neural network approaches for helmet-based localization, and provide evaluation metrics to establish a baseline for future studies in this field. Data and code can be found at url{https://lqiutong.github.io/HelmetPoser.github.io/}
|
|
08:40-08:45, Paper WeAT20.3 | |
Relevance-Driven Decision Making for Safer and More Efficient Human Robot Collaboration |
|
Zhang, Xiaotong | Massachusetts Institute of Technology |
Huang, Dingcheng | Massachusetts Institute of Technology |
Youcef-Toumi, Kamal | Massachusetts Institute of Technology |
Keywords: Human-Robot Collaboration, Cognitive Modeling, Collision Avoidance
Abstract: Human brain possesses the ability to effectively focus on important environmental components, which enhances perception, learning, reasoning, and decision-making. Inspired by this cognitive mechanism, we introduced a novel concept termed relevance for Human-Robot Collaboration (HRC). Relevance is a dimensionality reduction process that incorporates a continuously operating perception module, evaluates cue sufficiency within the scene, and applies a flexible formulation and computation framework. In this paper, we present an enhanced two-loop framework that integrates real-time and asynchronous processing to quantify relevance and leverage it for safer and more efficient human-robot collaboration (HRC). The two-loop framework integrates an asynchronous loop, which leverages an LLM’s world knowledge to quantify relevance, and a real-time loop, which performs scene understanding, human intent prediction, and decision-making based on relevance. HRC decision-making is enhanced by a relevance-based task allocation method, as well as a motion generation and collision avoidance approach that incorporates human trajectory prediction. Simulations and experiments show that our methodology for relevance quantification can accurately and robustly predict the human objective and relevance, with an average accuracy of up to 0.90 for objective prediction and up to 0.96 for relevance prediction. Moreover, our motion generation methodology reduces collision cases by 63.76% and collision frames by 44.74% when compared with a state-of-the-art (SOTA) collision avoidance method. Our framework and methodologies, with relevance, guide the robot on how to best assist humans and generate safer and more efficient actions for HRC.
|
|
08:45-08:50, Paper WeAT20.4 | |
Back to the Cartesian: Pilot Study for Assessing Human Stiffness in 3D Cartesian Space by Transforming from Muscle Space in a Peg-In-Hole Scenario for Tele-Impedance |
|
Thuerauf, Sabine | Friedrich-Alexander-University Erlangen-Nuremberg |
Mehrkens, Florian | FAU Erlangen-Nuernberg |
Castellini, Claudio | Friedrich-Alexander-Universität Erlangen-Nürnberg |
Sierotowicz, Marek | Friedrich-Alexander Universität Erlangen Nürnberg |
Keywords: Telerobotics and Teleoperation, Compliance and Impedance Control, Intention Recognition
Abstract: For various teleoperation tasks, position-based control is not practical. An impedance-based control is superior e.g. for handling fragile objects, like harvesting fruits or grasping a paper cup. However, only a few researchers have focused on impedance control for teleoperation. In tele-impedance, the stiffness of a human is measured and transferred to a controller of a robot. Until now, human stiffness was mostly measured either for specific joints or in 2D Cartesian space. We introduce a new way of measuring Cartesian stiffness in 3D using electromyography. Users were asked to perform a peg-in-hole task in three different orientations (0°, 45°, 90°). Meanwhile, electromyography measurements at shoulder and elbow muscle groups are performed. In a proof-of-concept study, we showed that the measured stiffness matrix in Cartesian space differed significantly across the three differently oriented peg-in-hole scenarios. This demonstrates that human stiffness could be predicted in 3D Cartesian space based on the type of task at hand.
|
|
08:50-08:55, Paper WeAT20.5 | |
Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images |
|
Käs, Stephanie | RWTH Aachen University |
Peter, Sven | RWTH Aachen University |
Thillmann, Henrik | Chair for Computer Vision, RWTH Aachen University |
Burenko, Anton | RWTH Aachen |
Adrian, David Benjamin | Bosch Corporate Research & Ulm University |
Mack, Dennis | Robert Bosch GmbH |
Linder, Timm | Robert Bosch GmbH |
Leibe, Bastian | RWTH Aachen University |
Keywords: Gesture, Posture and Facial Expressions, Human Detection and Tracking, Omnidirectional Vision
Abstract: Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel FISHnCHIPS dataset, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses.
|
|
08:55-09:00, Paper WeAT20.6 | |
HuMAn – the Human Motion Anticipation Algorithm Based on Recurrent Neural Networks |
|
Noppeney, Victor | University of São Paulo |
Escalante, Felix M | São Paulo State University |
Maggi, Lucas | University of Sao Paulo |
Boaventura, Thiago | University of Sao Paulo |
Keywords: Modeling and Simulating Humans, Human and Humanoid Motion Analysis and Synthesis, Intention Recognition
Abstract: Predicting human motion may lead to considerable advantages for human-robot interaction, particularly when precise synchronization between the robot’s motion and the user’s movement is imperative. The inherent stochastic nature of human behavior, combined with the restricted window of response, can give rise to residual and undesirable forces during interactions, potentially harming the user. Therefore, efficient prediction of human joint movements may enhance the performance of various interaction control frameworks used in wearable robots. This paper proposes the HuMAn algorithm for predicting human joint motion based on a recurrent neural network. This algorithm consists of a long-term memory network, used to interpret sequences of poses, and a prediction layer, employed to build the most likely future user poses within a specified time horizon. Network training was performed using datasets encompassing various subjects and types of motion. The results demonstrate the effectiveness of the proposed algorithm, as evidenced by average general prediction errors below 0.1 radians for predictive horizons of up to 500 milliseconds. Furthermore, a mean absolute error of 0.026 radians was achieved for a periodic treadmill walk. Simulation results demonstrate a large improvement in transparency control performance in a case study with an upper limb exoskeleton robot.
|
|
WeAT21 |
410 |
Robot Foundation Models 1 |
Regular Session |
Chair: Li, Hui | Autodesk Research |
Co-Chair: Nguyen, Anh | University of Liverpool |
|
08:30-08:35, Paper WeAT21.1 | |
Robotic-CLIP: Fine-Tuning CLIP on Action Data for Robotic Applications |
|
Nguyen, Nghia | FPT Software Company Limited |
Vu, Minh Nhat | TU Wien, Austria |
Ta, Tung D. | The University of Tokyo |
Huang, Baoru | Imperial College London |
Vo, Thieu | National University of Singapore |
Le, Ngan | University of Arkansas |
Nguyen, Anh | University of Liverpool |
Keywords: Perception-Action Coupling, Representation Learning
Abstract: Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (~7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP's strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.
|
|
08:35-08:40, Paper WeAT21.2 | |
In-Context Imitation Learning Via Next-Token Prediction |
|
Fu, Letian | UC Berkeley |
Huang, Huang | University of California at Berkeley |
Datta, Gaurav | UC Berkeley |
Chen, Lawrence Yunliang | UC Berkeley |
Panitch, William | University of California, Berkeley |
Liu, Fangchen | University of California, Berkeley |
Li, Hui | Autodesk Research |
Goldberg, Ken | UC Berkeley |
Keywords: Learning from Demonstration, Imitation Learning, Data Sets for Robot Learning
Abstract: In-context imitation learning is the capability to perform novel tasks when prompted with task demonstration examples. In-Context Robot Transformer (ICRT) is a causal transformer that performs autoregressive prediction on sensorimotor trajectories, which include images, proprioceptive states, and actions. This approach supports flexible and trainingfree execution of new tasks at test time. Experiments with a Franka Emika robot demonstrate that ICRT can adapt to new environment configurations that differ from both the prompt and the training data. In a multi-task environment setup, ICRT significantly outperforms current state-of-the-art robot foundation models on generalization to unseen tasks. Code, data, and appendix are available on https://icrt.dev.
|
|
08:40-08:45, Paper WeAT21.3 | |
Data Augmentation for NeRFs in the Low Data Limit |
|
Gaggar, Ayush | Northwestern University |
Murphey, Todd | Northwestern University |
Keywords: Incremental Learning, Deep Learning for Visual Perception, Planning under Uncertainty
Abstract: Current methods based on Neural Radiance Fields fail in the low data limit, particularly when training on incomplete scene data. Prior works augment training data only in next-best-view applications, which lead to hallucinations and model collapse with sparse data. In contrast, we propose adding a set of views during training by rejection sampling from a posterior uncertainty distribution, generated by combining a volumetric uncertainty estimator with spatial coverage. We validate our results on partially observed scenes; on average, our method performs 39.9% better with 87.5% less variability across established scene reconstruction benchmarks, as compared to state of the art baselines. We further demonstrate that augmenting the training set by sampling from any distribution leads to better, more consistent scene reconstruction in sparse environments. This work is foundational for robotic tasks where augmenting a dataset with informative data is critical in resource-constrained, a priori unknown environments. Videos and source code are available at https://murpheylab.github.io/low-data-nerf.
|
|
08:45-08:50, Paper WeAT21.4 | |
Generalizable Imitation Learning through Pre-Trained Representations |
|
Chang, Wei-Di | McGill University |
Hogan, Francois | Massachusetts Institute of Technology |
Fujimoto, Scott | McGill University |
Meger, David Paul | McGill University |
Dudek, Gregory | McGill University |
Keywords: Imitation Learning, Learning from Demonstration, Representation Learning
Abstract: In this paper, we leverage self-supervised vision transformer models and their emergent semantic abilities to improve the generalization abilities of imitation learning policies. We introduce DVK, an imitation learning algorithm that leverages rich pre-trained Visual Transformer patch-level embeddings to obtain better generalization when learning through demonstrations. Our learner sees the world by clustering appearance features into groups associated with semantic concepts, forming stable keypoints that generalize across a wide range of appearance variations and object types. We demonstrate how this representation enables generalized behaviour by evaluating imitation learning across a diverse dataset of object manipulation tasks. To facilitate further study of generalization in Imitation Learning, all of our code for the method and evaluation, as well as the dataset, is made available.
|
|
08:50-08:55, Paper WeAT21.5 | |
Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors Via Language Grounding |
|
Jones, Joshua | University of California, Berkeley |
Mees, Oier | University of California, Berkeley |
Sferrazza, Carmelo | UC Berkeley |
Stachowicz, Kyle | University of California, Berkeley |
Abbeel, Pieter | UC Berkeley |
Levine, Sergey | UC Berkeley |
Keywords: Big Data in Robotics and Automation, Sensorimotor Learning, Learning from Demonstration
Abstract: Interacting with the world is a multi-sensory experience: achieving effective general-purpose interaction requires making use of all available modalities -- including vision, touch, and audio -- to fill in gaps from partial observation. For example, when vision is occluded reaching into a bag, a robot should rely on its senses of touch and sound. However, state-of-the-art generalist robot policies are typically trained on large datasets to predict robot actions solely from visual and proprioceptive observations. In this work, we propose FuSe, a novel approach that enables finetuning visuomotor generalist policies on heterogeneous sensor modalities for which large datasets are not readily available by leveraging natural language as a common cross-modal grounding. We combine a multimodal contrastive loss with a sensory-grounded language generation loss to encode high-level semantics. In the context of robot manipulation, we show that FuSe enables performing challenging tasks that require reasoning jointly over modalities such as vision, touch, and sound in a zero-shot setting, such as multimodal prompting, compositional cross-modal prompting, and descriptions of objects it interacts with. We show that the same recipe is applicable to widely different generalist policies, including both diffusion-based generalist policies and large vision-language-action (VLA) models. Extensive experiments in the real world show that FuSe is able to increase success rates by over 20% compared to all considered baselines.
|
|
08:55-09:00, Paper WeAT21.6 | |
Simultaneous Geometry and Pose Estimation of Held Objects Via 3D Foundation Models |
|
Zhi, Weiming | Carnegie Mellon University |
Tang, Haozhan | Carnegie Mellon University |
Zhang, Tianyi | Carnegie Mellon University |
Johnson-Roberson, Matthew | Carnegie Mellon University |
Keywords: Deep Learning for Visual Perception, Perception for Grasping and Manipulation, Deep Learning Methods
Abstract: Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
|
|
WeAT22 |
411 |
Learning for Robot Control |
Regular Session |
|
08:30-08:35, Paper WeAT22.1 | |
Gradient Descent-Based Task-Orientation Robot Control Enhanced with Gaussian Process Predictions |
|
Roveda, Loris | SUPSI-IDSIA |
Pavone, Marco | Stanford University |
Keywords: Machine Learning for Robot Control, Model Learning for Control, Compliance and Impedance Control
Abstract: This paper proposes a novel force-based task-orientation controller for interaction tasks with environmental orientation uncertainties. The main aim of the controller is to align the robot tool along the main task direction (e.g., along screwing, insertion, polishing, etc.) without the use of any external sensors (e.g., vision systems), relying only on end-effector wrench measurements/estimations. We propose a gradient descent-based orientation controller, enhancing its performance with the orientation predictions provided by a Gaussian Process model. Derivation of the controller is presented, together with simulation results (considering a probing task) and experimental results involving various re-orientation scenarios, i.e., i) a task with the robot in interaction with a soft environment, ii) a task with the robot in interaction with a stiff and inclined environment, and iii) a task to enable the assembly of a gear into its shaft. The proposed controller is compared against a state-of-the-art approach, highlighting its ability to re-orient the robot tool even in complex tasks (where the state-of-the-art method fails).
|
|
08:35-08:40, Paper WeAT22.2 | |
Model-Free Inverse H-Infinity Control for Imitation Learning (I) |
|
Xue, Wenqian | University of Florida |
Lian, Bosen | Auburn University |
Kartal, Yusuf | Turkish Aerospace |
Fan, Jialu | Northeastern University |
Chai, Tianyou | Northeastern University, Shenyang, China |
Lewis, Frank | The University of Texas at Arlington |
Keywords: Reinforcement Learning, Imitation Learning, Machine Learning for Robot Control
Abstract: This paper proposes a data-driven model-free inverse reinforcement learning (IRL) algorithm tailored for solving an inverse H_infty control problem. In the problem, both an expert and a learner engage in H_infty control to reject disturbances and the learner's objective is to imitate the expert's behavior by reconstructing the expert's performance function through IRL techniques. Introducing zero-sum game principles, we first formulate a model-based single-loop IRL policy iteration algorithm that includes three key steps: updating the policy, action, and performance function using a new correction formula and the standard inverse optimal control principles. Building upon the model-based approach, we propose a model-free single-loop off-policy IRL algorithm that eliminates the need for initial stabilizing policies and prior knowledge of the dynamics of expert and learner. Also, we provide rigorous proof of convergence, stability, and Nash optimality to guarantee the effectiveness and reliability of the proposed algorithms. Furthermore, we showcase the efficiency of our algorithm through simulations and experiments, highlighting its advantages compared to the existing methods.
|
|
08:40-08:45, Paper WeAT22.3 | |
Learning Object Properties Using Robot Proprioception Via Differentiable Robot-Object Interaction |
|
Chen, Peter Yichen | MIT |
Liu, Chao | Massachusetts Institute of Technology |
Ma, Pingchuan | MIT CSAIL |
Eastman, John | Massachusetts Institute of Technology |
Rus, Daniela | MIT |
Randle, Dylan Labatt | Amazon Robotics |
Ivanov, Yuri | Amazon |
Matusik, Wojciech | MIT |
Keywords: Machine Learning for Robot Control, Sensorimotor Learning, Learning from Demonstration
Abstract: Differentiable simulation has become a powerful tool for system identification. While prior work has focused on identifying robot properties using robot-specific data or object properties using object-specific data, our approach calibrates object properties by using information from the robot, without relying on data from the object itself. Specifically, we utilize robot joint encoder information, which is commonly available in standard robotic systems. Our key observation is that by analyzing the robot's reactions to manipulated objects, we can infer properties of those objects, such as inertia and softness. Leveraging this insight, we develop differentiable simulations of robot-object interactions to inversely identify the properties of the manipulated objects. Our approach relies solely on proprioception — the robot’s internal sensing capabilities — and does not require external measurement tools or vision-based tracking systems. This general method is applicable to any articulated robot and requires only joint position information. We demonstrate the effectiveness of our method on a low-cost robotic platform, achieving accurate mass and elastic modulus estimations of manipulated objects with just a few seconds of computation on a laptop.
|
|
08:45-08:50, Paper WeAT22.4 | |
Reservoir Computing Encodes Physical Adaptations for Reinforcement Learning |
|
Giannetto, Cross | CCIR |
Ibragim, Atadjanov | Kyung Hee University |
Iida, Fumiya | University of Cambridge |
Abdulali, Arsen | Cambridge University |
Keywords: Machine Learning for Robot Control, Deep Learning Methods, Reinforcement Learning
Abstract: Adapting reinforcement learning (RL) policies to various robot body configurations is a significant challenge for creating flexible autonomous systems. This study presents a novel framework that integrates Reservoir Computing (RC) with the First-Order Reduced and Controlled Error (FORCE) learning rule to enhance policy adaptability in RL. The RC serves as a dynamic feature extractor, capturing temporal dependencies by pre-training on state transitions generated through random actions. This pre-training acts as regularization, reducing variance and preventing overfitting to specific configurations Subsequently, the control policy network is trained on a limited set of body variations using the enriched features from the RC. Experimental results across three distinct environments demonstrate that the proposed RC+FORCE framework significantly improves policy performance and adaptability to unseen robot configurations compared to traditional reinforcement learning through domain randomization. These findings highlight the effectiveness of combining RC-based feature extraction with FORCE-based training in developing robust RL agents.
|
|
08:50-08:55, Paper WeAT22.5 | |
Self-Supervised Meta-Learning for All-Layer DNN-Based Adaptive Control with Stability Guarantees |
|
He, Guanqi | Carnegie Mellon University |
Choudhary, Yogita | Carnegie Mellon University |
Shi, Guanya | Carnegie Mellon University |
Keywords: Machine Learning for Robot Control, Aerial Systems: Mechanics and Control, Robust/Adaptive Control
Abstract: A critical goal of adaptive control is enabling robots to rapidly adapt in dynamic environments. Recent studies have developed a meta-learning-based adaptive control scheme, which uses meta-learning to extract nonlinear features (represented by Deep Neural Networks (DNNs)) from offline data, and uses adaptive control to update linear coefficients online. However, such a scheme is fundamentally limited by the linear parameterization of uncertainties and does not fully unleash the capability of DNNs. This paper introduces a novel learning-based adaptive control framework that pretrains a DNN via self-supervised meta-learning (SSML) from offline trajectories and online adapts the full DNN via composite adaptation. In particular, the offline SSML stage leverages the time consistency in trajectory data to train the DNN to predict future disturbances from history, in a self-supervised manner without environment condition labels. The online stage carefully designs a control law and an adaptation law to update the full DNN with stability guarantees. Empirically, the proposed framework significantly outperforms (19-39%) various classic and learning-based adaptive control baselines, in challenging real-world quadrotor tracking problems under large dynamic wind disturbance.
|
|
08:55-09:00, Paper WeAT22.6 | |
Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation |
|
Luo, Jing Yuan | ETH Zurich |
Song, Yunlong | University of Zurich |
Klemm, Victor | ETH Zurich |
Shi, Fan | National University of Singapore |
Scaramuzza, Davide | University of Zurich |
Hutter, Marco | ETH Zurich |
Keywords: Machine Learning for Robot Control, Vision-Based Navigation, Legged Robots
Abstract: First-order Policy Gradient (FoPG) algorithms such as Backpropagation through Time and Analytical Policy Gradients leverage local simulation physics to accelerate policy search, significantly improving sample efficiency in robot control compared to standard model-free reinforcement learning. However, FoPG algorithms can exhibit poor learning dynamics in contact-rich tasks like locomotion. Previous approaches address this issue by alleviating contact dynamics via algorithmic or simulation innovations. In contrast, we propose guiding the policy search by learning a residual over a simple baseline policy. For quadruped locomotion, we find that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL. Additionally, we provide insights on applying FoPG's to pixel-based local navigation, training a point-mass robot to convergence within seconds. Finally, we showcase the versatility of FoPG RPL by using it to train locomotion and perceptive navigation end-to-end on a quadruped in minutes.
|
|
WeAT23 |
412 |
Autonomous Vehicle Perception 3 |
Regular Session |
Chair: Wang, Shenlong | University of Illinois at Urbana-Champaign |
Co-Chair: Chen, Yong-Sheng | National Yang Ming Chiao Tung University |
|
08:30-08:35, Paper WeAT23.1 | |
METDrive: Multimodal End-To-End Autonomous Driving with Temporal Guidance |
|
Guo, Ziang | Skolkovo Institute of Science and Technology |
Lin, Xinhao | Insititute of Automation, Qilu University of Technology (Shandon |
Yagudin, Zakhar | Skolkovo Institute of Science and Technology |
Lykov, Artem | Skolkovo Institute of Science and Technology |
Wang, Yong | Insititute of Automation, Qilu University of Technology (Shandon |
Li, Yanqiang | Institute of Automation, Qilu University of Technology (Shandong |
Tsetserukou, Dzmitry | Skolkovo Institute of Science and Technology |
Keywords: Imitation Learning, Integrated Planning and Learning, Sensor Fusion
Abstract: Multimodal end-to-end autonomous driving has shown promising advancements in recent work. By embedding more modalities into end-to-end networks, the system’s understanding of both static and dynamic aspects of the driving environment is enhanced, thereby improving the safety of autonomous driving. In this paper, we introduce METDrive, an end-to-end system that leverages temporal guidance from the embedded time series features of ego states, including rotation angles, steering, throttle signals, and waypoint vectors. The geometric features derived from perception sensor data and the time series features of ego state data jointly guide the waypoint prediction with the proposed temporal guidance loss function. We evaluated METDrive on the CARLA leaderboard benchmarks, achieving a driving score of 70%, a route completion score of 94%, and an infraction score of 0.78.
|
|
08:35-08:40, Paper WeAT23.2 | |
Generalizing Motion Planners with Mixture of Experts for Autonomous Driving |
|
Sun, Qiao | Shanghai QiZhi Institute |
Wang, Huimin | Li Auto |
Zhan, Jiahao | Fudan University |
Nie, Fan | Stanford University |
Wen, Xin | Li Auto |
Xu, Leimeng | Li Auto |
Zhan, Kun | LiAuto |
Jia, Peng | Li Auto |
Lang, Xianpeng | LiAuto |
Zhao, Hang | Tsinghua University |
Keywords: Learning from Demonstration, Representation Learning, Imitation Learning
Abstract: Large real-world driving datasets have sparked significant research into various aspects of learning-based motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. In this paper, we review and benchmark previous methods. Experiments show that many of these approaches have limited generalization abilities in planning performance due to overly complex designs or training paradigms. Experiments further reveal that as models are appropriately scaled, many designs become redundant. Therefore, we introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner. STR2uses a Vision Transformer (ViT) encoder and a mix-of-experts (MoE) causal transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. We evaluate its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow.
|
|
08:40-08:45, Paper WeAT23.3 | |
Low-Rank Adaptation-Based All-Weather Removal for Autonomous Navigation |
|
Rajagopalan, Sudarshan | Johns Hopkins University |
Patel, Vishal | Johns Hopkins University |
Keywords: Computer Vision for Automation, Autonomous Vehicle Navigation
Abstract: All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for real-world autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRA-based fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pre-trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation. Project page: https://sudraj2002.github.io/loraapage/.
|
|
08:45-08:50, Paper WeAT23.4 | |
Stands on Shoulders of Giants: Learning to Lift 2D Detection to 3D with Geometry-Driven Objectives |
|
Chen, Jhih Rong | National Yang Ming Chiao Tung University |
Chang, Che Yuan | National Yang Ming Chiao Tung University |
Tseng, Szu Han | Elan Microelectronics Corporation |
Huang, Chih Sheng | Elan Microelectronics Corporation |
Chen, Yong-Sheng | National Yang Ming Chiao Tung University |
Chiu, Wei-Chen | National Chiao Tung University |
Keywords: Computer Vision for Automation, AI-Based Methods, Vision-Based Navigation
Abstract: 3D detection of vehicles is an essential component for autonomous driving applications. Nevertheless, collecting the supervised training data for learning 3D vehicle detectors would be costly (e.g. utilization of expensive LiDAR sensors) and labor-intensive (for human annotation). In comparison to 3D detection, 2D object detection has achieved a well-developed status, boosting stable and robust performance with widespread application in numerous fields, thanks to the large scale (i.e. amount of samples) of existing training datasets of 2D object detection. Hence, in our work, we propose to realize 3D detection via leveraging the robustness of 2D detectors and developing a network that lifts 2D detections to 3D. With the flexibility of building upon various backbone models (e.g. the models which take image regions detected by 2D detector as inputs to predict their corresponding 3D bounding boxes, or the existing monocular 3D detection models which have the intermediate output of 2D bounding boxes), we propose several geometry-driven objectives, including projection consistency loss, geometry depth loss, and opposite bin loss, to improve the training upon 2D-to-3D lifting. Our extensive experimental results demonstrate that our proposed geometry-driven objectives not only contribute to the superior results of 3D detection but also provide better generalizability across datasets.
|
|
08:50-08:55, Paper WeAT23.5 | |
LidarDM: Generative LiDAR Simulation in a Generated World |
|
Zyrianov, Vlas | University of Illinois at Urbana-Champaign |
Che, Henry | University of Illinois, Urbana-Champaign |
Liu, Zhijian | Massachusetts Institute of Technology |
Wang, Shenlong | University of Illinois at Urbana-Champaign |
Keywords: Autonomous Vehicle Navigation, Simulation and Animation, AI-Based Methods
Abstract: We present LidarDM, a novel LiDAR generative model capable of producing realistic, layout-aware, physically plausible, and temporally coherent LiDAR videos. LidarDM stands out with two unprecedented capabilities in LiDAR generative modeling: (i) LiDAR generation guided by driving scenarios, offering significant potential for autonomous driving simulations, and (ii) 4D LiDAR point cloud generation, enabling the creation of realistic and temporally coherent sequences. At the heart of our model is a novel integrated 4D world generation framework. Specifically, we employ latent diffusion models to generate the 3D scene, combine it with dynamic actors to form the underlying 4D world, and subsequently produce realistic sensory observations within this virtual environment. Our experiments indicate that our approach outperforms competing algorithms in realism, temporal coherency, and layout consistency. We additionally show that LidarDM can be used as a generative world model simulator for training and testing perception models.
|
|
08:55-09:00, Paper WeAT23.6 | |
RenderWorld: World Model with Self-Supervised 3D Label |
|
Yan, Ziyang | University of Trento |
Dong, Wenzhen | The Chinese University of HongKong |
Shao, Yihua | University of Science and Technology Beijing |
Lu, Yuhang | ShanghaiTech University |
Liu, Haiyang | University of Science and Technology Beijing |
Liu, Jingwen | University of Science and Technology Beijing |
Wang, Haozhe | Hong Kong University of Science and Technology |
Wang, Zhe | Institute for AI Industry Research, Tsinghua University |
Wang, Yan | Tsinghua University |
Remondino, Fabio | FBK |
Ma, Yuexin | ShanghaiTech University |
Keywords: Computer Vision for Automation, Planning, Scheduling and Coordination, Object Detection, Segmentation and Categorization
Abstract: End-to-end autonomous driving with vision-only is not only more cost-effective compared to LiDAR-vision fusion but also more reliable than traditional methods. To achieve a economical and robust purely visual autonomous driving system, we propose RenderWorld, a vision-only end-to-end autonomous driving framework, which generates 3D occupancy labels using a self-supervised gaussian-based Img2Occ Module, then encodes the labels by AM-VAE, and use world model for forecasting and planning. RenderWorld employs Gaussian Splatting to represent 3D scenes and render 2D images greatly improves segmentation accuracy and reduces GPU memory consumption compared with NeRF-based methods. By applying AM-VAE to encode air and non-air separately, RenderWorld achieves more fine-grained scene element representation, lead- ing to state-of-the-art performance in both 4D occupancy forecasting and motion planning from autoregressive world model.
|
|
WeBT2 |
301 |
SLAM 4 |
Regular Session |
Chair: Rosen, David | Northeastern University |
Co-Chair: De Cristóforis, Pablo | University of Buenos Aires |
|
09:55-10:00, Paper WeBT2.1 | |
Introspective Loop Closure for SLAM with 4D Imaging Radar |
|
Hilger, Maximilian | Technical University of Munich |
Kubelka, Vladimir | Örebro University |
Adolfsson, Daniel | Örebro University |
Becker, Ralf | Company Bosch Rexroth |
Andreasson, Henrik | Örebro University |
Lilienthal, Achim J. | Orebro University |
Keywords: SLAM, Mapping, Localization
Abstract: Simultaneous Localization and Mapping (SLAM) allows mobile robots to navigate without external positioning systems or pre-existing maps. Radar is emerging as a valuable sensing tool, especially in vision-obstructed environments, as it is less affected by particles than lidars or cameras. Modern 4D imaging radars provide three-dimensional geometric information and relative velocity measurements, but they bring challenges such as a small field of view and sparse, noisy point clouds. Detecting loop closures in SLAM is critical for reducing trajectory drift and maintaining map accuracy. However, the directional nature of 4D radar data makes identifying loop closures, especially from reverse viewpoints, difficult due to limited scan overlap. This article explores using 4D radar for loop closure in SLAM, focusing on similar and opposing viewpoints. We generate submaps for a denser environment representation and use introspective measures to reject false detections in feature-degenerate environments. Our experiments show accurate loop closure detection in geometrically diverse settings for both similar and opposing viewpoints, improving trajectory estimation with up to 82 % improvement in ATE and rejecting false positives in self-similar environments.
|
|
10:00-10:05, Paper WeBT2.2 | |
Range-Based 6-DoF Monte Carlo SLAM with Gradient-Guided Particle Filter on GPU |
|
Nakao, Takumi | Nagoya University |
Koide, Kenji | National Institute of Advanced Industrial Science and Technology |
Takanose, Aoki | National Institute of Advanced Industrial Science and Technology |
Oishi, Shuji | National Institute of Advanced Industrial Science and Technology |
Yokozuka, Masashi | Nat. Inst. of Advanced Industrial Science and Technology |
Date, Hisashi | University of Tsukuba |
Keywords: SLAM, Mapping, Range Sensing
Abstract: This paper presents range-based 6-DoF Monte Carlo SLAM with a gradient-guided particle update strategy. While non-parametric state estimation methods, such as particle filters, are robust in situations with high ambiguity, they are known to be unsuitable for high-dimensional problems due to the curse of dimensionality. To address this issue, we propose a particle update strategy that improves the sampling efficiency by using the gradient information of the likelihood function to guide particles toward its local maxima. Additionally, we introduce a keyframe-based map representation that represents the global map as a set of past frames (i.e., keyframes) to mitigate memory consumption. The keyframe poses for each particle are corrected using a simple loop closure method to maintain trajectory consistency. The combination of gradient information and keyframe-based map representation significantly enhances sampling efficiency and reduces memory usage compared to traditional RBPF approaches. To process a large number of particles (e.g., 100,000 particles) in real-time, the proposed framework is designed to fully exploit GPU parallel processing. Experimental results demonstrate that the proposed method exhibits extreme robustness to state ambiguity and can even deal with kidnapping situations, such as when the sensor moves to different floors via an elevator, with minimal heuristics.
|
|
10:05-10:10, Paper WeBT2.3 | |
Distributed Certifiably Correct Range-Aided SLAM |
|
Thoms, Alexander | University of California Los Angeles |
Papalia, Alan | Massachusetts Institute of Technology |
Velasquez, Jared | University of California, Los Angeles |
Rosen, David | Northeastern University |
Narasimhan, Sriram | University of California, Los Angeles |
Keywords: Multi-Robot SLAM, Range Sensing
Abstract: Reliable simultaneous localization and mapping (SLAM) algorithms are necessary for safety-critical autonomous navigation. In the communication-constrained multi-agent setting, navigation systems increasingly use point-to-point range sensors as they afford measurements with low bandwidth requirements and known data association. The state estimation problem for these systems takes the form of range-aided (RA) SLAM. However, distributed algorithms for solving the RA-SLAM problem lack formal guarantees on the quality of the returned estimate. To this end, we present the first distributed algorithm for RA-SLAM that can efficiently recover certifiably globally optimal solutions. Our algorithm, distributed certifiably correct RA-SLAM (DCORA), achieves this via the Riemannian Staircase method, where computational procedures developed for distributed certifiably correct pose graph optimization are generalized to the RA-SLAM problem. We demonstrate DCORA's efficacy on real-world multi-agent datasets by achieving absolute trajectory errors comparable to those of a state-of-the-art centralized certifiably correct RA-SLAM algorithm. Additionally, we perform a parametric study on the structure of the RA-SLAM problem using synthetic data, revealing how common parameters affect DCORA's performance.
|
|
10:10-10:15, Paper WeBT2.4 | |
CoVoxSLAM: GPU Accelerated Globally Consistent Dense SLAM |
|
Hoss, Emiliano | University of Buenos Aires |
De Cristóforis, Pablo | University of Buenos Aires |
Keywords: SLAM, Mapping, Embedded Systems for Robotic and Automation
Abstract: A dense SLAM system is essential for mobile robots, as it provides localization and allows navigation, path planning, obstacle avoidance, and decision making in unstructured environments. Due to increasing computational demands, the use of GPUs in dense SLAM is expanding. In this work, we present coVoxSLAM, a novel GPU-accelerated volumetric SLAM system that takes full advantage of the parallel processing power of the GPU to build globally consistent maps even in large-scale environments. It was deployed on different platforms (discrete and embedded GPUs) and compared with the state-of-the-art. The results obtained using public datasets show that coVoxSLAM delivers a significant performance improvement considering execution times while maintaining accurate localization. The presented system is available as an open-source system on GitHub: https://github.com/lrse-uba/coVoxSLAM.
|
|
10:15-10:20, Paper WeBT2.5 | |
Radar4VoxMap: Accurate Odometry from Blurred Radar Observations |
|
Seok, Jiwon | Hanyang University |
Kim, Soyeong | Hanyang University |
Jo, Jaeyoung | Konkuk University, Smart Vehicle Engineering |
Lee, Jaehwan | Hanyang University |
Minseo, Jung | Hanyang |
Jo, Kichun | Hanyang University |
Keywords: SLAM, Mapping, Range Sensing
Abstract: Compared to conventional 3D radar, the 4D imaging radar provides additional height data and finer resolution measurements. Moreover, compared to LiDAR sensors, 4D imaging radar is more cost-effective and offers enhanced durability against challenging weather conditions. Despite these advantages, radar-based localization systems face several challenges, including limited resolution, leading to scattered object recognition and less precise localization. Additionally, existing methods that form submaps from filtered results can accumulate errors, leading to blurred submaps and reducing the accuracy of the SLAM and odometry. To address these challenges, this paper introduces Radar4VoxMap, a novel approach designed to enhance radar-only odometry. The method includes an RCS-weighted voxel distribution map that improves registration accuracy. Furthermore, fixed-lag optimization with the graph is used to optimize both the submap and pose, effectively reducing cumulative errors. The proposed method has shown strong performance on open datasets. The code is available at: url{https://github.com/ailab-hanyang/Radar4VoxMap
|
|
10:20-10:25, Paper WeBT2.6 | |
GenZ-ICP: Generalizable and Degeneracy-Robust LiDAR Odometry Using an Adaptive Weighting |
|
Lee, Daehan | Pohang University of Science and Technology |
Lim, Hyungtae | Massachusetts Institute of Technology |
Han, Soohee | Pohang University of Science and Technology ( POSTECH ) |
Keywords: SLAM, Localization, Mapping
Abstract: Light detection and ranging (LiDAR)-based odometry has been widely utilized for pose estimation due to its use of high-accuracy range measurements and immunity to ambient light conditions. However, the performance of LiDAR odometry varies depending on the environment and deteriorates in degenerative environments such as long corridors. This issue stems from the dependence on a single error metric, which has different strengths and weaknesses depending on the geometrical characteristics of the surroundings. To address these problems, this study proposes a novel iterative closest point (ICP) method called GenZ-ICP. We revisited both point-to-plane and point-topoint error metrics and propose a method that leverages their strengths in a complementary manner. Moreover, adaptability to diverse environments was enhanced by utilizing an adaptive weight that is adjusted based on the geometrical characteristics of the surroundings. As demonstrated in our experimental evaluation, the proposed GenZ-ICP exhibits high adaptability to various environments and resilience to optimization degradation in corridor-like degenerative scenarios by preventing ill-posed problems during the optimization process. Our code is available at https://github.com/cocel-postech/genz-icp.
|
|
10:25-10:30, Paper WeBT2.7 | |
Free-Init: Scan-Free, Motion-Free, and Correspondence-Free Initialization for Doppler LiDAR-Inertial Systems |
|
Zhao, Mingle | University of Macau |
Wang, Jiahao | University of Macau |
Gao, Tianxiao | University of Macao |
Xu, Chengzhong | University of Macau |
Kong, Hui | University of Macau |
Keywords: SLAM, Localization, Mapping
Abstract: Robust initialization is crucial for online systems. In the letter, a high-frequency and resilient initialization framework is designed for LiDAR-inertial systems, leveraging both inertial sensors and Doppler LiDAR. The innovative FMCW Doppler LiDAR opens up a novel avenue for robotic sensing by capturing not only point range but also Doppler velocity via the intrinsic Doppler effect. By fusing point-wise Doppler velocity with inertial measurements under non-inertial kinematics, the proposed framework, Free-Init, eliminates reliance on motion undistortion of LiDAR scans, excitation motions, and map correspondences during the initialization phase. Free-Init is also plug-and-play compatible with typical LiDAR-inertial systems and is versatile to handle a wide range of initial motions when the system starts, including stationary, dynamic, and even violent motions. The embedded Doppler-inertial velocimeter ensures fast convergence and high-frequency performance, delivering outputs exceeding 10 kHz. Comprehensive experiments on diverse platforms and across myriad motion scenes validate the framework's effectiveness. The results demonstrate the superior performance of Free-Init, highlighting the necessity of fast, resilient, and dynamic initialization for online systems.
|
|
WeBT3 |
303 |
Mechanism Design 2 |
Regular Session |
Chair: Yim, Justin K. | University of Illinois Urbana-Champaign |
Co-Chair: Santin, Marco | Aalen University |
|
09:55-10:00, Paper WeBT3.1 | |
Development of a 2-DOF Singularity-Free Spherical Parallel Remote Center of Motion Mechanism with Extensive Range of Motion |
|
Liu, Chun | National Taiwan University |
Lin, Pei-Chun | National Taiwan University |
Keywords: Actuation and Joint Mechanisms, Mechanism Design
Abstract: In this paper, we report the development of an innovative two-degrees-of-freedom (2-DOF) spherical parallel remote center of motion mechanism (SPRCMM), which can offer a wide range of movement in both DOFs without encountering singularities. To facilitate the design process, the paper briefly reviews the existing spherical joints, including serial and parallel structures with and without the remote center of motion (RCM). Aiming at combining the advantages of these existing spherical joints, this paper proposes a novel design that utilizes the parallelogram mechanism to form a parallel RCM mechanism without using universal or spherical joints. Forward and inverse kinematics were constructed using the product of the exponentials. Moreover, space and closed Jacobians were derived, accompanied by manipulability in the available workspace for the mechanism. The prototype of the 2-DOF SPRCMM was built and experimentally evaluated. The experimental results confirm that the singularity-free motion of the two DOFs of the mechanism in a wide range is feasible, and the root mean squared errors in the trajectory tracking of the mechanism in most states were less than 10% of the motion range.
|
|
10:00-10:05, Paper WeBT3.2 | |
Highly Dynamic Physical Interaction for Robotics: Design and Control of an Active Remote Center of Compliance |
|
Friedrich, Christian | Karlsruhe University of Applied Sciences |
Frank, Patrick | Hochschule Karlsruhe - University of Applied Sciences (HKA) |
Santin, Marco | Aalen University |
Haag, Carl Matthias | Hochschule Aalen |
Keywords: Mechanism Design, Force Control, Industrial Robots
Abstract: Robot interaction control is often limited to low dynamics or low flexibility, depending on whether an active or passive approach is chosen. In this work, we introduce a hybrid control scheme that combines the advantages of active and passive interaction control. To accomplish this, we propose the design of a novel Active Remote Center of Compliance (ARCC), which is based on a passive and active element which can be used to directly control the interaction forces. We introduce surrogate models for a dynamic comparison against purely robot-based interaction schemes. In a comparative validation, ARCC drastically improves the interaction dynamics, leading to an increase in the motion bandwidth of up to 31 times. We introduce further our control approach as well as the integration in the robot controller. Finally, we analyze ARCC on different industrial benchmarks like peg-in-hole, top-hat rail assembly and contour following problems and compare it against the state of the art, to highlight the dynamic and flexibility. The proposed system is especially suited if the application requires a low cycle time combined with a sensitive manipulation.
|
|
10:05-10:10, Paper WeBT3.3 | |
Pinto: A Latched Spring Actuated Robot for Jumping and Perching |
|
Xu, Christopher | University of Illinois Urbana-Champaign |
Yan, Huihan | University of Illinois at Urbana-Champaign |
Yim, Justin K. | University of Illinois Urbana-Champaign |
Keywords: Mechanism Design, Legged Robots, Compliant Joints and Mechanisms
Abstract: Arboreal environments challenge current robots but are deftly traversed by many familiar animals such as squirrels. We present a small, 450 g robot "Pinto" developed for tree-jumping, a behavior seen in squirrels but rarely in legged robots: jumping from the ground onto a vertical tree trunk. We develop a powerful and lightweight latched series-elastic actuator using a twisted string and carbon fiber springs. We consider the effects of scaling down conventional quadrupeds and experimentally show how storing energy in a parallel-elastic fashion using a latch increases jump energy compared to series-elastic or springless strategies. By switching between series and parallel-elastic modes with our latched 5-bar leg mechanism, Pinto executes energetic jumps as well as maintains continuous control during shorter bounding motions. We also develop sprung 2-DoF arms equipped with spined grippers to grasp tree bark for high-speed perching following a jump.
|
|
10:10-10:15, Paper WeBT3.4 | |
D3-ARM: High-Dynamic, Dexterous and Fully Decoupled Cable-Driven Robotic Arm |
|
Luo, Hong | Tsinghua University |
Xu, Jianle | Tsinghua University |
Li, Shoujie | Tsinghua Shenzhen International Graduate School |
Liang, Huayue | Tsinghua University |
Chen, Yanbo | Tsinghua University |
Xia, Chongkun | Sun Yat-Sen University |
Wang, Xueqian | Center for Artificial Intelligence and Robotics, Graduate School |
Keywords: Tendon/Wire Mechanism, Mechanism Design, Robot Safety
Abstract: Cable transmission enables motors of robotic arm to operate lightweight and low-inertia joints remotely in various environments, but it also creates issues with motion coupling and cable routing that can reduce arm's control precision and performance. In this paper, we present a novel motion decoupling mechanism with low-friction to align the cables and efficiently transmit the motor's power. By arranging these mechanisms at the joints, we fabricate a fully decoupled and lightweight cable-driven robotic arm called D3-Arm with all the electrical components be placed at the base. Its 776 mm length moving part boasts six degrees of freedom (DOF) and only 1.6 kg weights. To address the issue of cable slack, a cable-pretension mechanism is integrated to enhance the stability of long-distance cable transmission. Through a series of comprehensive tests, D3-Arm demonstrated 1.29 mm average positioning error and 2 kg payload capacity, proving the practicality of the proposed decoupling mechanisms in cable-driven robotic arm.
|
|
10:15-10:20, Paper WeBT3.5 | |
Design of an Articulated Modular Caterpillar Using Spherical Linkages |
|
O'Connor, Sam | University of Notre Dame |
Plecnik, Mark | University of Notre Dame |
Keywords: Mechanism Design, Kinematics, Multi-Robot Systems
Abstract: Articulation between body segments of small insects and animals is a three degree-of-freedom (DOF) motion. Implementing this kind of motion in a compact robot is usually not tractable due to limitations in small actuator technologies. In this work, we concede full 3-DOF control and instead select a one degree-of-freedom curve in SO(3) to articulate segments of a caterpillar robot. The curve is approximated with a spherical four-bar, which is synthesized through optimal rigid body guidance. We specify the desired SO(3) motion using discrete task positions, then solve for candidate mechanisms by computing all roots of the stationary conditions using numerical homotopy continuation. A caterpillar robot prototype demonstrates the utility of this approach. This synthesis procedure is also used to design prolegs for the caterpillar robot. Each segment contains two DC motors and a shape memory alloy, which is used for latching and unlatching between segments. The caterpillar robot is capable of walking, steering, object manipulation, body articulation, and climbing.
|
|
10:20-10:25, Paper WeBT3.6 | |
Informed Repurposing of Quadruped Legs for New Tasks |
|
Chen, Fuchen | Arizona State University |
Aukes, Daniel | Arizona State University |
Keywords: Mechanism Design, Legged Robots, Compliant Joints and Mechanisms
Abstract: Redesigning and remanufacturing robots are infeasible for resource-constrained environments like space or undersea. This work thus studies how to evaluate and repurpose existing, complementary, quadruped legs for new tasks. We implement this approach on 15 robot designs generated from combining six pre-selected leg designs. The performance maps for force-based locomotion tasks like pulling, pushing, and carrying objects are constructed via a learned policy that works across all designs and adapts to the limits of each. Performance predictions agree well with real-world validation results. The robot can locomote at 0.5 body lengths per second while exerting a force that is almost 60% of its weight.
|
|
10:25-10:30, Paper WeBT3.7 | |
Generative-AI-Driven Jumping Robot Design Using Diffusion Models |
|
Kim, Byungchul | MIT |
Wang, Tsun-Hsuan | Massachusetts Institute of Technology |
Rus, Daniela | MIT |
Keywords: Mechanism Design, Methods and Tools for Robot System Design, Deep Learning Methods
Abstract: Recent advances in foundation models are significantly expanding the capabilities of AI models. As part of this progress, this paper introduces a robot design framework that uses a diffusion model approach for generating 3D mesh structures. Specifically, we focus on generating directly fabricable robot structures that require no post-processing guided by human-imposed design constraints. Our approach can find the optimal design of the robot by optimizing or composing embedding vectors of the model. The efficacy of the framework is validated through an application to design, fabricate, and evaluate a jumping robot. Our solution is an optimized jumping robot with a 41% increase in jump height compared to the state-of-the-art design. Additionally, when the robot is augmented with an optimized foot, it can land reliably with a success ratio of 88% in contrast to the 4% success ratio of the base robot.
|
|
WeBT4 |
304 |
Sensor Fusion 1 |
Regular Session |
Co-Chair: Forbes, James Richard | McGill University |
|
09:55-10:00, Paper WeBT4.1 | |
A Hessian for Gaussian Mixture Likelihoods in Nonlinear Least Squares |
|
Korotkine, Vassili | McGill University |
Cohen, Mitchell | McGill University |
Forbes, James Richard | McGill University |
Keywords: Sensor Fusion, Probabilistic Inference, SLAM
Abstract: This paper proposes a novel Hessian approximation for Maximum a Posteriori estimation problems in robotics involving Gaussian mixture likelihoods. Previous approaches manipulate the Gaussian mixture likelihood into a form that allows the problem to be represented as a nonlinear least squares (NLS) problem. The resulting Hessian approximation used within NLS solvers from these approaches neglects certain nonlinearities. The proposed Hessian approximation is derived by setting the Hessians of the Gaussian mixture component errors to zero, which is the same starting point as for the Gauss-Newton Hessian approximation for NLS, and using the chain rule to account for additional nonlinearities. The proposed Hessian approximation results in improved convergence speed and uncertainty characterization for simulated experiments, and similar performance to the state of the art on real-world experiments. A method to maintain compatibility with existing solvers, such as ceres, is also presented. Accompanying software and supplementary material can be found at https://github.com/decargroup/hessian_sum_mixtures.
|
|
10:00-10:05, Paper WeBT4.2 | |
Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios |
|
Xu, Jialei | Harbin Institute of Technology |
Li, Rui | Northwestern Polytechnical University |
Cheng, Kai | USTC |
Jiang, Junjun | Harbin Institute of Technology |
Liu, Xianming | Harbin Institute of Technology |
Keywords: Deep Learning for Visual Perception, Sensor Fusion, RGB-D Perception
Abstract: Monocular depth estimation from RGB images plays a pivotal role in 3D vision. However, its accuracy can deteriorate in challenging environments such as nighttime or adverse weather conditions. While long-wave infrared cameras offer stable imaging in such challenging conditions, they are inherently low-resolution, lacking rich texture and semantics as delivered by the RGB image. Current methods focus solely on a single modality due to the difficulties to identify and integrate faithful depth cues from both sources. To address these issues, this paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework. Concretely, we independently compute the coarse depth maps with separate networks by fully utilizing the individual depth cues from each modality. As the advantageous depth spreads across both modalities, we propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas. With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner. Harnessing the proposed pipeline, our method demonstrates the ability of robust depth estimation in a variety of difficult scenarios. Experimental results on the challenging MS^2 and ViViD++ datasets demonstrate the effectiveness and robustness of our method.
|
|
10:05-10:10, Paper WeBT4.3 | |
Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection |
|
Yang, Yiran | University of Chinese Academy of Sciences |
Gao, Xu | Baidu |
Wang, Tong | Baidu |
Hao, Xin | Baidu |
Shi, Yifeng | BAIDU.INC |
Tan, Xiao | Baidu |
Ye, Xiaoqing | Baidu Inc |
Keywords: Computer Vision for Automation, Sensor Fusion
Abstract: Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems. However, these sensors often exhibit heterogeneous natures, resulting in distributional modality gaps that present significant challenges for fusion. To address this, a robust fusion technique is crucial, particularly for enhancing 3D object detection. In this paper, we introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations to enhance the fusion process. Specifically, we propose a triphase domain aligning module. This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences. Additionally, we explore improved representation acquisition methods for dynamic fusion, which includes modal interaction and specialty enhancement. Finally, an adaptive learning technique that merges the semantics and geometry information for dynamical instance optimization. Extensive experiments in the nuScenes dataset present competitive performance with state-of-the-art approaches. Our code will be released in the future.
|
|
10:10-10:15, Paper WeBT4.4 | |
Bridging Spectral-Wise and Multi-Spectral Depth Estimation Via Geometry-Guided Contrastive Learning |
|
Shin, Ukcheol | CMU(Carnegie Mellon University) |
Lee, Kyunghyun | KAIST |
Oh, Jean | Carnegie Mellon University |
Keywords: Computer Vision for Transportation, Sensor Fusion, Deep Learning for Visual Perception
Abstract: Deploying depth estimation networks in the real world requires high-level robustness against various adverse conditions to ensure safe and reliable autonomy. For this purpose, many autonomous vehicles employ multi-modal sensor systems, including an RGB camera, NIR camera, thermal camera, LiDAR, or Radar. They mainly adopt two strategies to use multiple sensors: modality-wise and multi-modal fused inference. The former method is flexible but memory-inefficient, unreliable, and vulnerable. Multi-modal fusion can provide high-level reliability, yet it needs a specialized architecture. In this paper, we propose an effective solution, named align-and-fuse strategy, for the depth estimation from multi-spectral images. In the align stage, we align embedding spaces between multiple spectrum bands to learn shareable representation across multi-spectral images by minimizing contrastive loss of global and spatially aligned local features with geometry cues. After that, in the fuse stage, we train an attachable feature fusion module that can selectively aggregate the multi-spectral features for reliable and robust prediction results. Based on the proposed method, a single-depth network can achieve both spectral-invariant and multi-spectral fused depth estimation while preserving reliability, memory efficiency, and flexibility.
|
|
10:15-10:20, Paper WeBT4.5 | |
VAIR: Visuo-Acoustic Implicit Representations for Low-Cost, Multi-Modal Transparent Surface Reconstruction in Indoor Scenes |
|
Venkatramanan Sethuraman, Advaith | University of Michigan |
Bagoren, Onur | University of Michigan |
Seetharaman, Harikrishnan | University of Michigan - Ann Arbor |
Richardson, Dalton | University of Michigan |
Taylor, Joseph | University of Michigan, Ann Arbor |
Skinner, Katherine | University of Michigan |
Keywords: RGB-D Perception, Deep Learning for Visual Perception, Sensor Fusion
Abstract: Mobile robots operating indoors must be prepared to navigate challenging scenes that contain transparent surfaces. This paper proposes a novel method for the fusion of acoustic and visual sensing modalities through implicit neural represen- tations to enable dense reconstruction of transparent surfaces in indoor scenes. We propose a novel model that leverages generative latent optimization to learn an implicit representation of indoor scenes consisting of transparent surfaces. We demonstrate that we can query the implicit representation to enable volumetric rendering in image space or 3D geometry reconstruction (point clouds or mesh) with transparent surface prediction. We evaluate our method’s effectiveness qualitatively and quantitatively on a new dataset collected using a custom, low-cost sensing platform featuring RGB-D cameras and ultrasonic sensors. Our method exhibits significant improvement over state-of-the-art for transparent surface reconstruction.
|
|
10:20-10:25, Paper WeBT4.6 | |
CDMFusion: RGB-T Image Fusion Based on Conditional Diffusion Models Via Few Denoising Steps in Open Environments |
|
Yang, Luojie | Beijing Institute of Technology |
Yu, Meng | Beijing Institute of Technology |
Fang, Lijin | Beijing Institute of Technology |
Yang, Yi | Beijing Institute of Technology |
Yue, Yufeng | Beijing Institute of Technology |
Keywords: Sensor Fusion, Deep Learning for Visual Perception
Abstract: Multi-modal fusion can improve perceptual robustness and accuracy by fully utilizing multi-source sensor data. Current RGB-T fusion methods still falter with adverse illumination and weather. Recent advances in generative methods have shown the ability to enhance and restore visible images in adverse conditions. However, the fusion of RGB-T based on generative methods has not been studied in depth, due to limited attention given to the degradation of multi-modal features under challenging circumstances. Motivated by this observation, we propose CDMFusion, a three-branch conditional diffusion model that achieves fusion with dynamically enhancing multi-modal features and suppressing high-frequency interference. Specifically, we achieve feature-preserving fusion through three branches and establish a dynamic gating prediction module to adjust the enhancement of multi-modal features adaptively. In addition, considering the high time cost of existing diffusion models for generating fused images, we propose a skip patrol mechanism to achieve accelerated high-quality generation with no need for additional training. Experiments demonstrate our method achieves excellent performance in multiple datasets. The code and datasets are available at https://github.com/yangluojie/CDMFusion.
|
|
10:25-10:30, Paper WeBT4.7 | |
UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection |
|
Zhao, Haocheng | Xi'an Jiaotong-Liverpool University |
Guan, Runwei | University of Liverpool |
Wu, Taoyu | Xi'an Jiaotong-Liverpool University |
Man, Ka Lok | Xi'an Jiaotong-Liverpool University |
Yu, Limin | Xi'an Jiaotong-Liverpool University |
Yue, Yutao | Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Sensor Fusion, Object Detection, Segmentation and Categorization, AI-Based Methods
Abstract: 4D millimeter-wave (MMW) radar, which provides both height information and dense point cloud data over 3D MMW radar, has become increasingly popular in 3D object detection. In recent years, radar-vision fusion models have demonstrated performance close to that of LiDAR-based models, offering advantages in terms of lower hardware costs and better resilience in extreme conditions. However, many radar-vision fusion models treat radar as a sparse LiDAR, underutilizing radar-specific information. Additionally, these multi-modal networks are often sensitive to the failure of a single modality, particularly vision. To address these challenges, we propose the Radar Depth Lift-Splat-Shoot (RDL) module, which integrates radar-specific data into the depth prediction process, enhancing the quality of visual Bird’s-Eye View (BEV) features. We further introduce a Unified Feature Fusion (UFF) approach that extracts BEV features across different modalities using shared module. To assess the robustness of multi-modal models, we develop a novel Failure Test (FT) ablation experiment, which simulates vision modality failure by injecting Gaussian noise. We conduct extensive experiments on the View-of-Delft (VoD) and TJ4D datasets. The results demonstrated that our proposed Unified BEVFusion (UniBEVFusion) network significantly outperforms state-of-the-art models on the TJ4D dataset, with improvements of 3.96% in 3D and 4.17% in BEV object detection accuracy.
|
|
WeBT5 |
305 |
Aerial Robots: Mechanics and Control 1 |
Regular Session |
Chair: Yamamoto, Ko | University of Tokyo |
Co-Chair: Saldaña, David | Lehigh University |
|
09:55-10:00, Paper WeBT5.1 | |
A Generalized Thrust Estimation and Control Approach for Multirotors Micro Aerial Vehicles |
|
Santos, Davi Henrique dos | Universidade Federal Da Paraíba |
Saska, Martin | Czech Technical University in Prague |
Nascimento, Tiago | Universidade Federal Da Paraiba |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Motion Control
Abstract: This paper addresses the problem of thrust estimation and control for the rotors of small-sized multirotors Uncrewed Aerial Vehicles (UAVs). Accurate control of the thrust generated by each rotor during flight is one of the main challenges for robust control of quadrotors. The most common approach is to approximate the mapping of rotor speed to thrust with a simple quadratic model. This model is known to fail under non-hovering flight conditions, introducing errors into the control pipeline. One of the approaches to modeling the aerodynamics around the propellers is the Blade Element Momentum Theory (BEMT). Here, we propose a novel BEMT-based closed-loop thrust estimator and control to eliminate the laborious calibration step of finding several aerodynamic coefficients. We aim to reuse known values as a baseline and fit the thrust estimate to values closest to the real ones with a simple test bench experiment, resulting in a single scaling value. A feedforward PID thrust control was implemented for each rotor, and the methods were validated by outdoor experiments with two multirotor UAV platforms: 250mm and 500mm. A statistical analysis of the results showed that the thrust estimation and control provided better robustness under aerodynamically varying flight conditions compared to the quadratic model.
|
|
10:00-10:05, Paper WeBT5.2 | |
Trajectory Planning and Control for Differentially Flat Fixed-Wing Aerial Systems |
|
Morando, Luca | New York University |
Salunkhe, Sanket Ankush | Colorado School of Mines |
Bobbili, Nishanth | New York University |
Mao, Jeffrey | New York University |
Masci, Luca | New York University |
Hung, Nguyen | Instituto Superior Técnico |
De Souza Jr., Cristino | Technology Innovation Institute |
Loianno, Giuseppe | New York University |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications
Abstract: Efficient real-time trajectory planning and control for fixed-wing unmanned aerial vehicles is challenging due to their non-holonomic nature, complex dynamics, and the additional uncertainties introduced by unknown aerodynamic effects. In this paper, we present a fast and efficient real-time trajectory planning and control approach for fixed-wing unmanned aerial vehicles, leveraging the differential flatness property of fixed-wing aircraft in coordinated flight conditions to generate dynamically feasible trajectories. The approach provides the ability to continuously replan trajectories, which we show is useful to dynamically account for the curvature constraint as the aircraft advances along its path. Extensive simulations and real-world experiments validate our approach, showcasing its effectiveness in generating trajectories across various flight conditions, including wind disturbances.
|
|
10:05-10:10, Paper WeBT5.3 | |
Safe Quadrotor Navigation Using Composite Control Barrier Functions |
|
Harms, Marvin Chayton | NTNU |
Jacquet, Martin | NTNU |
Alexis, Kostas | NTNU - Norwegian University of Science and Technology |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy
Abstract: This paper introduces a safety filter to ensure collision avoidance for multirotor aerial robots. The proposed formalism leverages a single Composite Control Barrier Function from all position constraints acting on a third-order nonlinear representation of the robot's dynamics. We analyze the recursive feasibility of the safety filter under the composite constraint and demonstrate that the infeasible set is negligible. The proposed method allows computational scalability against thousands of constraints and, thus, complex scenes with numerous obstacles. We experimentally demonstrate its ability to guarantee the safety of a quadrotor with an onboard LiDAR, operating in both indoor and outdoor cluttered environments against both naive and adversarial nominal policies.
|
|
10:10-10:15, Paper WeBT5.4 | |
The Spinning Blimp: Design and Control of a Novel Minimalist Aerial Vehicle Leveraging Rotational Dynamics and Locomotion |
|
Santens, Leonardo | Lehigh University |
S. D'Antonio, Diego | Lehigh University |
Hou, Shuhang | Lehigh University |
Saldaña, David | Lehigh University |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications
Abstract: This paper presents the Spinning Blimp, a novel lighter-than-air (LTA) aerial vehicle designed for low-energy stable flight. Using an oblate spheroid helium balloon for buoyancy, the vehicle achieves minimal energy consumption while maintaining prolonged airborne states. The unique and low-cost design employs a passively arranged wing coupled with a propeller to induce a spinning behavior, providing inherent pendulum-like stabilization. We propose a control strategy that takes advantage of the continuous revolving nature of the spinning blimp to control translational motion. The cost-effectiveness of the vehicle makes it highly suitable for a variety of applications, such as patrolling, localization, air and turbulence monitoring, and domestic surveillance. Experimental evaluations affirm the design's efficacy and underscore its potential as a versatile and economically viable solution for aerial applications.
|
|
10:15-10:20, Paper WeBT5.5 | |
One Net to Rule Them All: Domain Randomization in Quadcopter Racing across Different Platforms |
|
Ferede, Robin | TU Delft |
Blaha, Till Martin | Delft University of Technology |
Lucassen, Erin | Delft University of Technology |
De Wagter, Christophe | Delft University of Technology |
de Croon, Guido | TU Delft |
Keywords: Aerial Systems: Mechanics and Control, Reinforcement Learning, Robust/Adaptive Control
Abstract: In high-speed quadcopter racing, finding a single controller that works well across different platforms remains challenging. This work presents the first neural network controller for drone racing that generalizes across physically distinct quadcopters. We demonstrate that a single network, trained with domain randomization, can robustly control various types of quadcopters. The network relies solely on the current state to directly compute motor commands. The effectiveness of this generalized controller is validated through real-world tests on two substantially different crafts (3-inch and 5-inch race quadcopters). We further compare the performance of this generalized controller with controllers specifically trained for the 3-inch and 5-inch drone, using their identified model parameters with varying levels of domain randomization (0%, 10%, 20%, 30%). While the generalized controller shows slightly slower speeds compared to the fine-tuned models, it excels in adaptability across different platforms. Our results show that no randomization fails sim-to-real transfer while increasing randomization improves robustness but reduces speed. Despite this trade-off, our findings highlight the potential of domain randomization for generalizing controllers, paving the way for universal AI controllers that can adapt to any platform.
|
|
10:20-10:25, Paper WeBT5.6 | |
Modeling and Control of Aerial Robot SERPENT: A Soft Structure Incorporated Multirotor Aerial Robot Capable of In-Flight Flexible Deformation |
|
Itahara, Shotaro | The University of Tokyo |
Nishio, Takuzumi | The University of Tokyo |
Ishigaki, Taiki | The University of Tokyo |
Sugihara, Junichiro | The University of Tokyo |
Zhao, Moju | The University of Tokyo |
Yamamoto, Ko | University of Tokyo |
Keywords: Aerial Systems: Mechanics and Control
Abstract: This paper introduces a novel method for controlling multirotor aerial robots connected by passive flexible elements. Despite the growing popularity of multirotor aerial robots, their real-world applications remain limited due to difficulties adapting to complex environments. Soft robotics, due to their inherent flexibility, offer a potential solution, though research on integrating flexible elements into aerial robots is still in the early stages. In this study, we propose control methods for a system where multiple aerial robots are interconnected with passive flexible elements. These robotic systems enhance adaptability, enabling tasks like object manipulation. We model the flexible parts using the piecewise constant strain (PCS) model, which allows for model-based closed-loop control and stabilizes various configurations of the system. Through simulations and experiments, we validated that the proposed method achieves both stable flight and flexible deformation. Notably, we succeeded in maintaining stable flight, which was not possible with traditional methods, and demonstrated both positional controllability and the ability of the flexible parts to bend dynamically during flight.
|
|
10:25-10:30, Paper WeBT5.7 | |
Embodying Compliant Touch on Drones for Aerial Tactile Navigation |
|
Bredenbeck, Anton | TU Delft |
Della Santina, Cosimo | TU Delft |
Hamaza, Salua | TU Delft |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy, Compliant Joints and Mechanisms
Abstract: Aerial robots are a well-established solution for environmental surveying, exploration, and inspection, thanks to their superior maneuverability and agility. Nowadays, the algorithms that provide these capabilities rely on GNSS and Vision, which are obstructed in some environments of interest, e.g., indoors and underground or in smoke and dust. In similar conditions, animals rely on the sense of touch and compliant responses to interactions embodied in the body morphology. This way, they can navigate safely using tactile cues by feeling the environment surrounding them. In this work, we take inspiration from the natural example and propose an approach that allows a quadrotor to navigate using tactile information from the environment. We propose to endow a conventional quadrotor with a novel robotic finger that embodies compliance and sensing capabilities. We complete this design with a navigation approach that generates new waypoints based on the robotic finger's contact information to follow the unknown environment. The overall system's evaluation shows successful, repeatable results in 36 flight experiments with various relative angles between the drone and a planar surface.
|
|
WeBT6 |
307 |
Vision-Based Navigation 2 |
Regular Session |
|
09:55-10:00, Paper WeBT6.1 | |
Adaptive Learning for Hybrid Visual Odometry |
|
Liu, Ziming | INRIA |
Malis, Ezio | Inria |
Martinet, Philippe | INRIA |
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Transportation
Abstract: Hybrid visual odometry methods achieve state-of-the-art performance by fusing both data-based deep learning networks and rule-based localization approaches. However, these methods also suffer from deep learning domain gap problems, which leads to an accuracy drop of the hybrid visual odometry approach when new type of data is considered. This paper is the first to explore a practical solution to this problem. Indeed, the deep learning network in the hybrid visual odometry predicts the stereo disparity with fixed searching space. However, the disparity distribution is unbalanced in stereo images acquired in different environments. We propose an adaptive network structure to overcome this problem. Secondly, the rule-based localization module has a robust performance by online optimizing the camera pose in test data, which motivates us to introduce test-time training machine learning method for improving the data-based part of the hybrid visual odometry.
|
|
10:00-10:05, Paper WeBT6.2 | |
SOLVR: Submap Oriented LiDAR-Visual Re-Localisation |
|
Knights, Joshua Barton | Queensland University of Technology |
Barbas Laina, Sebastián | TU Munich |
Moghadam, Peyman | CSIRO |
Leutenegger, Stefan | Technical University of Munich |
Keywords: Deep Learning Methods, Deep Learning for Visual Perception, Recognition
Abstract: This paper proposes SOLVR, a unified pipeline for learning based LiDAR-Visual re-localisation which performs place recognition and 6-DoF registration across sensor modalities. We propose a strategy to align the input sensor modalities by leveraging stereo image streams to produce metric depth predictions with pose information, followed by fusing multiple scene views from a local window using a probabilistic occupancy framework to expand the limited field-of-view of the camera. Additionally, SOLVR adopts a flexible definition of what constitutes positive examples for different training losses, allowing us to simultaneously optimise place recognition and registration performance. Furthermore, we replace RANSAC with a registration function that weights a simple least-squares fitting with the estimated inlier likelihood of sparse keypoint correspondences, improving performance in scenarios with a low inlier ratio between the query and retrieved place.
|
|
10:05-10:10, Paper WeBT6.3 | |
SSF: Sparse Long-Range Scene Flow for Autonomous Driving |
|
Khoche, Ajinkya | KTH Royal Institute of Technology Stockholm, SCANIA CV AB |
Zhang, Qingwen | KTH Royal Institute of Technology |
Pereira Sanchez, Laura | Stanford University |
Asefaw, Aron | Royal Institute of Technology |
Sharif Mansouri, Sina | Scania |
Jensfelt, Patric | KTH - Royal Institute of Technology |
Keywords: Deep Learning Methods, Computer Vision for Transportation, Object Detection, Segmentation and Categorization
Abstract: Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our source code is open-sourced at https://github.com/KTH-RPL/SSF.
|
|
10:10-10:15, Paper WeBT6.4 | |
BoxMap: Efficient Structural Mapping and Navigation |
|
Wang, Zili | Boston University |
Allum, Christopher | Boston University |
Andersson, Sean | Boston University |
Tron, Roberto | Boston University |
Keywords: Deep Learning Methods, Autonomous Agents, Task and Motion Planning
Abstract: While humans can successfully navigate using abstractions, ignoring details that are irrelevant to the task at hand, most of the existing approaches in robotics require detailed environment representations which consume a significant amount of sensing, computing, and storage; these issues become particularly important in resource-constrained settings with limited power budgets. Deep learning methods can learn from prior experience to abstract knowledge from novel environments, and use it to more efficiently execute tasks such as frontier exploration, object search, or scene understanding. We propose BoxMap, a Detection-Transformer-based architecture that takes advantage of the structure of the sensed partial environment to update a topological graph of the environment as a set of semantic entities (rooms and doors) and their relations (connectivity). The predictions from low-level measurements can be leveraged to achieve high-level goals with lower computational costs than methods based on detailed representations. As an example application, we consider a robot equipped with a 2-D laser scanner tasked with exploring a residential building. Our BoxMap representation scales quadratically with the number of rooms (with a small constant), resulting in significant savings over a full geometric map. Moreover, our high-level topological representation results in 30.9% shorter trajectories in the exploration task with respect to a standard method. Code is available at: bit.ly/3F6w2Yl.
|
|
10:15-10:20, Paper WeBT6.5 | |
UncAD: Towards Safe End-To-End Autonomous Driving Via Online Map Uncertainty |
|
Yang, Pengxuan | University of Chinese Academy of Sciences (UCAS) |
Zheng, Yupeng | School of Artificial Intelligence, University of Chinese Academy |
Zhang, Qichao | Institute of Automation, Chinese Academy of Sciences |
Zhu, Kefei | UCAS |
Xing, Zebin | UCAS |
Lin, Qiao | EACON Technology Co., Ltd |
Liu, Yun-Fu | Eacon |
Su, Zhiguo | EACON Technology Co., Ltd |
Zhao, Dongbin | Chinese Academy of Sciences |
Keywords: Vision-Based Navigation, Integrated Planning and Learning, Computer Vision for Transportation
Abstract: End-to-end autonomous driving aims to produce planning trajectories from raw sensors directly. Currently, most approaches integrate perception, prediction, and planning modules into a fully differentiable network, promising great scalability. However, these methods typically rely on deterministic modeling of online maps in the perception module for guiding or constraining vehicle planning, which may incorporate erroneous perception information and further compromise planning safety. To address this issue, we delve into the importance of online map uncertainty for enhancing autonomous driving safety and propose a novel paradigm named UncAD. Specifically, UncAD first estimates the uncertainty of the online map in the perception module. It then leverages the uncertainty to guide motion prediction and planning modules to produce multi-modal trajectories. Finally, to achieve safer autonomous driving, UncAD proposes an uncertainty-collision-aware planning selection strategy according to the online map uncertainty to evaluate and select the best trajectory. In this study, we incorporate UncAD into various state-of-the-art (SOTA) end-to-end methods. Experiments on the nuScenes dataset show that integrating UncAD, with only a 1.9% increase in parameters, can reduce collision rates by up to 26% and drivable area conflict rate by up to 42%. Codes, pre-trained models, and demo videos can be accessed at https://github.com/pengxuanyang/UncAD.
|
|
10:20-10:25, Paper WeBT6.6 | |
Multi-Floor Zero-Shot Object Navigation Policy |
|
Zhang, Lingfeng | The Hong Kong University of Science and Technology (Guangzhou) |
Wang, Hao | Hong Kong University of Science and Technology(Guang Zhou) |
Xiao, Erjia | The Hong Kong University of Science and Technology (Guangzhou) |
Zhang, Xinyao | Hong Kong University of Science and Technology (GUANGZHOU) |
Zhang, Qiang | The Hong Kong University of Science and Technology (Guangzhou) |
Jiang, Zixuan | HKUST(GZ) |
Xu, Renjing | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Vision-Based Navigation, Embodied Cognitive Science, Visual Learning
Abstract: Object navigation in multi-floor environments presents a formidable challenge in robotics, requiring sophisticated spatial reasoning and adaptive exploration strategies. Traditional approaches have primarily focused on single-floor scenarios, overlooking the complexities introduced by multi-floor structures. To address these challenges, we first propose a Multi-floor Navigation Policy (MFNP) and implement it in Zero-Shot object navigation tasks. Our framework comprises three key components: (i) Multi-floor Navigation Policy, which enables an agent to explore across multiple floors; (ii) Multi-modal Large Language Models (MLLMs) for reasoning in the navigation process; and (iii) Inter-Floor Navigation, ensuring efficient floor transitions. We evaluate MFNP on the Habitat-Matterport 3D (HM3D) and Matterport 3D (MP3D) datasets, both include multi-floor scenes. Our experiment results demonstrate that MFNP significantly outperforms all the existing methods in Zero-Shot object navigation, achieving higher success rates and improved exploration efficiency. Ablation studies further highlight the effectiveness of each component in addressing the unique challenges of multi-floor navigation. Meanwhile, we conducted real-world experiments to evaluate the feasibility of our policy. Upon deployment of MFNP, the Unitree quadruped robot demonstrated successful multi-floor navigation and found the target object in a completely unseen environment. By introducing MFNP, we offer a new paradigm for tackling complex, multi-floor environments in object navigation tasks, opening avenues for future research in visual-based navigation in realistic, multi-floor settings.
|
|
10:25-10:30, Paper WeBT6.7 | |
Fed-EC: Bandwidth-Efficient Clustering-Based Federated Learning for Autonomous Visual Robot Navigation |
|
Gummadi, Shreya | University of Illinois at Urbana-Champaign |
Valverde Gasparino, Mateus | University of Illinois at Urbana-Champaign |
Vasisht, Deepak | University of Illinois at Urbana Champaign |
Chowdhary, Girish | University of Illinois at Urbana Champaign |
Keywords: Distributed Robot Systems, Vision-Based Navigation, Field Robots
Abstract: Centralized learning requires data to be aggregated at a central server, which poses significant challenges in terms of data privacy and bandwidth consumption. Federated learning presents a compelling alternative, however, vanilla Federated Learning methods deployed in robotics aim to learn a single global model across robots that works ideally for all. But in practice one model may not be well suited for robots deployed in various environments. This paper proposes Federated-EmbedCluster (Fed-EC), a clustering-based federated learning framework that is deployed with vision based autonomous robot navigation in diverse outdoor environments. The framework addresses the key federated learning challenge of deteriorating model performance of a single global model due to the presence of non-IID data across real-world robots. Extensive real-world experiments validate that Fed-EC reduces the communication size by 23x for each robot while matching the performance of centralized learning for goal-oriented navigation and outperforms local learning. Fed-EC can transfer previously learnt models to new robots that join the cluster.
|
|
WeBT7 |
309 |
Perception 1 |
Regular Session |
Co-Chair: Cho, Younggun | Inha University |
|
09:55-10:00, Paper WeBT7.1 | |
Using a Distance Sensor to Detect Deviations in a Planar Surface |
|
Sifferman, Carter | University of Wisconsin-Madison |
Sun, William | UW-Madison |
Gupta, Mohit | University of Wisconsin-Madison |
Gleicher, Michael | University of Wisconsin - Madison |
Keywords: Range Sensing, Deep Learning for Visual Perception, Vision-Based Navigation
Abstract: We investigate methods for determining if a planar surface contains geometric deviations (e.g. protrusions, objects, divots, or cliffs) using only an instantaneous measurement from a miniature optical time-of-flight sensor. The key to our method is to utilize the entirety of information encoded in raw time-of-flight data captured by off-the-shelf distance sensors. We provide an analysis of the problem in which we identify the key ambiguity between geometry and surface photometrics. To overcome this challenging ambiguity, we fit a Gaussian mixture model to a small dataset of planar surface measurements. This model implicitly captures the expected geometry and distribution of photometrics of the planar surface and is used to identify measurements that are likely to contain deviations. We characterize our method on a variety of surfaces and planar deviations across a range of scenarios. We find that our method utilizing raw time-of-flight data outperforms baselines which use only derived distance estimates. We build an example application in which our method enables mobile robot obstacle and cliff avoidance over a wide field-of-view.
|
|
10:00-10:05, Paper WeBT7.2 | |
Narrowing Your FOV with SOLiD: Spatially Organized and Lightweight Global Descriptor for FOV-Constrained LiDAR Place Recognition |
|
Kim, Hogyun | Inha University |
Choi, Jiwon | Inha University |
Sim, Taehu | Inha Uiversity |
Kim, Giseop | DGIST (Daegu Gyeongbuk Institute of Science and Technology) |
Cho, Younggun | Inha University |
Keywords: Localization, SLAM, Range Sensing
Abstract: We often encounter limited FOV situations due to various factors such as sensor fusion or sensor mount in real-world robot navigation. However, the limited FOV interrupts the generation of descriptions and impacts place recognition adversely. Therefore, we suffer from correcting accumulated drift errors in a consistent map using LiDAR-based place recognition with limited FOV. Thus, in this paper, we propose a robust LiDAR-based place recognition method for handling narrow FOV scenarios. The proposed method establishes spatial organization based on the range-elevation bin and azimuth-elevation bin to represent places. In addition, we achieve a robust place description through reweighting based on vertical direction information. Based on these representations, our method enables addressing rotational changes and determining the initial heading. Additionally, we designed a lightweight and fast approach for the robot's onboard autonomy. For rigorous validation, the proposed method was tested across various LiDAR place recognition scenarios (i.e., single-session, multi-session, and multi-robot scenarios). To the best of our knowledge, we report the first method to cope with the restricted FOV. Our place description and SLAM codes will be released. Also, the supplementary materials of our descriptor are available at https://sites.google.com/view/lidar-solid.
|
|
10:05-10:10, Paper WeBT7.3 | |
Towards Survivability in Complex Motion Scenarios: RGB-Event Object Tracking Via Historical Trajectory Prompting |
|
Xia, Wenhao | Dalian University of Technology |
Zhu, Jiawen | Dalian University of Technology |
He, You | Tsinghua University |
Qi, Jinqing | Dalian University of Technology |
Huang, Zihao | Dalian University of Technology |
Jia, Xu | Dalian University of Technology |
Keywords: Visual Tracking, Deep Learning for Visual Perception, Data Sets for Robotic Vision
Abstract: 事件数据最近成为 object 的有价值的辅助对象 跟踪,提供具有密集时间分辨率的提示,以及 高动态范围。现有的 RGB 事件跟踪器通常 在使用 仅靠 RGB 功能无法实现的复杂运动轨迹 提供足够的鉴别力。为了解决这个问题,我们 提出了一个创新的 RGB 事件跟踪框架,称为 EventTPT,通过触发 嵌入在历史 轨迹。具体来说,EventTPT 集成了 多个相邻帧的轨迹转换为单个 事件图像使用时间加权聚合和 随后将其作为视觉提示输入到 跟踪器进行当前帧定位。跨模态自适应 融合模块进一步设计用于 光度不一致的情况。此外,我们 提出了一种新颖的具有挑战性的 RGB 事件跟踪基准, EventUAV,包含具有高运动复杂
|
|
10:10-10:15, Paper WeBT7.4 | |
Spatially Constrained and Deeply Learned Bilateral Structural Intensity-Depth Registration Autonomously Navigates a Flexible Endoscope |
|
Fang, Hao | Xiamen University |
Wu, Ming | Xiamen University |
Fan, Wenkang | Xiamen University |
Luo, Guangcheng | Zhongshan Hospital Xiamen University |
Luo, Xiongbiao | Xiamen University |
Keywords: Vision-Based Navigation, Visual Tracking
Abstract: Endoscope tracking is commonly utilized to provide surgeons with in-body camera poses and visual fields during invasive procedures. The fundamental aspect of endoscopic navigation lies in precisely and continuously tracing the position and orientation of the endoscope within monocular endoscopic video sequences in a preoperative data space. This work proposes a new spatially constrained and deeply learned bilateral structural intensity-depth 2D-3D registration framework for autonomously navigating a flexible endoscope. Concretely, a novel bilateral structural intensity-depth similarity function is defined to tackle the deficiency of using image intensity, while a cross-domain monocular depth estimation model trained on virtual image data is used to accurately predict real image dense depth. Additionally, a spatial constraint is introduced to precisely reinitialize an optimizer to reduce accumulative tracking errors. We validate our method on clinical data, with the experimental results showing that our method significantly outperforms current vision-based navigation methods. Particularly, the average of position and orientation errors were reduced from (4.59mm, 9.22degree) to (1.65mm, 4.67degree).
|
|
10:15-10:20, Paper WeBT7.5 | |
E2B: A Single Modality Point-Based Tracker with Event Cameras |
|
Ren, Hongwei | The Hong Kong University of Science and Technology (Guangzhou) |
Li, Zhuo | Peking University |
Tuerhong, Aiersi | Chongqing University |
Liu, Haobo | The University of Electronic Science and Technology of China |
Liang, Fei | Huawei Technologies Company Ltd |
Feng, Yongxiang | Huawei Technologies Company Ltd |
Wang, Wenhui | Tsinghua University |
Wang, Yaoyuan | Huawei |
Zhang, Ziyang | Huawei, China |
He, Weihua | Tsinghua University |
Cheng, Bojun | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Visual Tracking, Representation Learning, Deep Learning Methods
Abstract: High-speed object tracking holds significant relevance across robotic domains, such as drones and autonomous driving. Compared to conventional cameras, event cameras are equipped with the ability to capture object motion information at exceptionally high temporal resolution with relatively low power consumption and remain immune from motion-blurring effects. Regrettably, many existing methods adopt a frame-based approach by stacking events into Event Frame, which overlooks the sparsity and high temporal resolution of events. This approach is reliant on the pre-training backbone and reaches a performance plateau but demands unrealistically large networks and high power consumption, rendering it impractical for real-time applications in battery-constrained scenarios. In this paper, we propose an efficient and effective single-modality tracker using Point Cloud representation named E2B (Event to Box). By directly handling the raw output of event cameras without dataformat transformation, E2B leverages events' coordinate guidance to accurately map Event Cloud features to 2D bounding boxes. Moreover, E2B incorporates the pyramid structure into the multi-stage feature extraction architecture to effectively track objects across diverse scales. In the experiments, E2B performs outstandingly on two large-scale and one synthetic event-based tracking datasets, covering both indoor and outdoor environments, as well as rigid and non-rigid objects.
|
|
10:20-10:25, Paper WeBT7.6 | |
F²R²: Frequency Filtering-Based Rectification Robustness Method for Stereo Matching |
|
Zhou, Haolong | Shanghai Institute of Microsystem and Information Technology, Ch |
Zhu, Dongchen | Shanghai Institute of Microsystem and Information Technology, Chi |
Zhang, Guanghui | Shanghai Institute of Microsystem and Information Technology, Ch |
Wang, Lei | Shanghai Institute of Microsystem and Information Technology, Ch |
Li, Jiamao | Shanghai Institute of Microsystem and Information Technology, Chi |
Keywords: Deep Learning for Visual Perception
Abstract: Most stereo matching networks assume that the stereo images are perfectly rectified, ignoring the perturbation of extrinsic parameters due to collisions, mechanical vibrations, and thermal expansion. This leads to poor rectification robustness in real-world stereo systems. That is, even minor rectification errors can lead to failure, making stereo systems unreliable for long-term autonomous operation in complex environments. In this paper, we are the first to propose a frequency filtering-based rectification robustness (F²R²) method for stereo matching, which aims to enhance the robustness of existing stereo networks to rectification errors. Specifically, we propose a sensitive frequency filter (SFF) to remove components susceptible to rectification errors within the frequency domain. SFF achieves the filtering through the learning-based adaptive filtering mask (AFM) guided by the spatial-frequency mapping modulation mask (SFM). Moreover, we build the matching feature reconstruction module (MFRM) to recover the features lost during filtering to benefit cost aggregation. Comprehensive experiments on simulated datasets and self-collected data validate that our method can significantly enhance the rectification robustness of stereo matching networks.
|
|
10:25-10:30, Paper WeBT7.7 | |
VisTune: Auto-Tuner for UAVs Using Vision-Based Localization |
|
Humais, Muhammad Ahmed | Khalifa University |
Chehadeh, Mohamad | Khalifa University for Science and Technology |
Azzam, Rana | Khalifa University of Science and Technology |
Boiko, Igor | Khalifa University |
Zweiri, Yahya | Khalifa University |
Keywords: Vision-Based Navigation, Aerial Systems: Mechanics and Control, Aerial Systems: Perception and Autonomy
Abstract: This paper presents VisTune, a method for automatic controller tuning, specifically designed for UAVs using vision-based localization for position control. In contrast to existing methods that involve flying the UAV manually to collect the data for system identification and tuning, our approach leverages relay-based system identification and tuning that autonomously generates stable oscillations, without the need for stabilizing controller. The whole process concludes within few seconds. Prior work in vision-based position control of the UAVs often ignores the delay from the perception pipeline, which is quite significant and results in suboptimal tuning and poor control performance. Our approach accounts for perception delay and addresses practical issues, such as varying delays due to varying computation requirements and inevitable estimation errors, which pose challenges in applying relay-based identification and tuning. Typically, VBL system introduces over 100 ms delay, compared to less than 20 ms delay when motion capture system is used. Moreover, we show that the perception delay identified by VisTune can be effectively used to temporally advance the feedforward acceleration signal to achieve better tracking performance. Finally, we demonstrate the robustness of the tuned controllers on a trajectory tracking task, reaching speed up to 2.1 m/s with RMS control error of only 0.054 m while under wind disturbance of 5 m/s we report RMSE of 0.116 m. A video of experiments is available at https://youtu.be/hJoT8bn0K0o
|
|
WeBT8 |
311 |
Representation Learning 2 |
Regular Session |
|
09:55-10:00, Paper WeBT8.1 | |
GeMuCo: Generalized Multisensory Correlational Model for Body Schema Learning |
|
Kawaharazuka, Kento | The University of Tokyo |
Okada, Kei | The University of Tokyo |
Inaba, Masayuki | The University of Tokyo |
Keywords: Learning from Experience, Software Architecture for Robotic and Automation, Cognitive Control Architectures
Abstract: Humans can autonomously learn the relationship between sensation and motion in their own bodies, estimate and control their own body states, and move while continuously adapting to the current environment. On the other hand, current robots control their bodies by learning the network structure described by humans from their experiences, making certain assumptions on the relationship between sensors and actuators. In addition, the network model does not adapt to changes in the robot's body, the tools that are grasped, or the environment, and there is no unified theory, not only for control but also for state estimation, anomaly detection, simulation, and so on. In this study, we propose a Generalized Multisensory Correlational Model (GeMuCo), in which the robot itself acquires a body schema describing the correlation between sensors and actuators from its own experience, including model structures such as network input/output. The robot adapts to the current environment by updating this body schema model online, estimates and controls its body state, and even performs anomaly detection and simulation. We demonstrate the effectiveness of this method by applying it to tool-use co
|
|
10:00-10:05, Paper WeBT8.2 | |
SplatSim: Zero-Shot Sim2Real Transfer of RGB Manipulation Policies Using Gaussian Splatting |
|
Qureshi, Mohammad Nomaan | Carnegie Mellon University |
Garg, Sparsh | Carnegie Mellon University |
Yandun, Francisco | Carnegie Mellon University |
Held, David | Carnegie Mellon University |
Kantor, George | Carnegie Mellon University |
Silwal, Abhisesh | Carnegie Mellon University |
Keywords: Sensorimotor Learning, Learning from Demonstration, Data Sets for Robot Learning
Abstract: Sim2Real transfer, particularly for manipulation policies relying on RGB images, remains a critical challenge in robotics due to the significant domain shift between synthetic and real-world visual data. In this paper, we propose SplatSim, a novel framework that leverages Gaussian Splatting as the primary rendering primitive to reduce the Sim2Real gap for RGB-based manipulation policies. By replacing traditional mesh representations with Gaussian Splats in simulators, SplatSim produces highly photorealistic synthetic data while maintaining the scalability and cost-efficiency of simulation. We demonstrate the effectiveness of our framework by training manipulation policies within SplatSim and deploying them in the real world in a zero-shot manner, achieving an average success rate of 86.25%, compared to 97.5% for policies trained on real-world data.
|
|
10:05-10:10, Paper WeBT8.3 | |
SR-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models |
|
Nguyen, Viet Dung | Rochester Institute of Technology |
Yang, Zhizhuo | Rochester Institute of Technology |
Buckley, Christopher | Verses AI |
Ororbia, Alexander | Rochester Institute of Technology |
Keywords: Reinforcement Learning, Deep Learning Methods, Bioinspired Robot Learning
Abstract: Although research has produced promising results demonstrating the utility of active inference (AIF) in Markov decision processes (MDPs), there is relatively less work that builds AIF models in the context of environments and problems that take the form of partially observable Markov decision processes (POMDPs). In POMDP scenarios, the agent must infer the unobserved environmental state from raw sensory observations, e.g., pixels in an image. Additionally, less work exists in examining the most difficult form of POMDP-centered control: continuous action space POMDPs under sparse reward signals. In this work, we address issues facing the AIF modeling paradigm by introducing novel prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. Empirically, we show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate.
|
|
10:10-10:15, Paper WeBT8.4 | |
Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning |
|
Keller, Leon | TU Darmstadt |
Tanneberg, Daniel | Honda Research Institute Europe |
Peters, Jan | Technische Universität Darmstadt |
Keywords: Imitation Learning, Representation Learning, Task and Motion Planning
Abstract: Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.
|
|
10:15-10:20, Paper WeBT8.5 | |
Chain-Of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models |
|
Wang, Chen | Stanford University |
Xia, Fei | Google Inc |
Yu, Wenhao | Google |
Zhang, Tingnan | Google |
Zhang, Ruohan | Stanford University |
Liu, Karen | Stanford University |
Fei-Fei, Li | Stanford University |
Tan, Jie | Google |
Liang, Jacky | Google |
Keywords: Machine Learning for Robot Control, Embodied Cognitive Science, AI-Enabled Robotics
Abstract: Learning to perform manipulation tasks from human videos is a promising approach for teaching robots. However, many manipulation tasks require changing control parameters during task execution, such as force, which visual data alone cannot capture. In this work, we leverage sensing devices such as armbands that measure human muscle activities and microphones that record sound, to capture the details in the human manipulation process, and enable robots to extract task plans and control parameters to perform the same task. To achieve this, we introduce Chain-of-Modality (CoM), a prompting strategy that enables Vision Language Models to reason about multimodal human demonstration data --- videos coupled with muscle or audio signals. By progressively integrating information from each modality, CoM refines a task plan and generates detailed control parameters, enabling robots to perform manipulation tasks based on a single multimodal human video prompt. Our experiments show that CoM delivers a threefold improvement in accuracy for extracting task plans and control parameters compared to baselines, with strong generalization to new task setups and objects in real-world robot experiments.
|
|
10:20-10:25, Paper WeBT8.6 | |
VertiCoder: Self-Supervised Kinodynamic Representation Learning on Vertically Challenging Terrain |
|
Nazeri, Mohammad | George Mason University |
Datar, Aniket | George Mason University |
Pokhrel, Anuj | George Mason University |
Pan, Chenhui | George Mason University |
Warnell, Garrett | U.S. Army Research Laboratory |
Xiao, Xuesu | George Mason University |
Keywords: Representation Learning, Learning from Experience, Wheeled Robots
Abstract: We present VertiCoder, a self-supervised representation learning approach for robot mobility on vertically challenging terrain. Using the same pre-training process, VertiCoder can handle four different downstream tasks, including forward kinodynamics learning, inverse kinodynamics learning, behavior cloning, and patch reconstruction with a single representation. VertiCoder uses a TransformerEncoder to learn the local context of its surroundings by random masking and next patch reconstruction. We show that VertiCoder achieves better performance across all four different tasks compared to specialized End-to-End models with 77% fewer parameters. We also show VertiCoder's comparable performance against state-of-the-art kinodynamic modeling and planning approaches in real-world robot deployment. These results underscore the efficacy of VertiCoder in mitigating overfitting and fostering more robust generalization across diverse environmental contexts and downstream vehicle kinodynamic tasks.
|
|
10:25-10:30, Paper WeBT8.7 | |
Correspondence Learning between Morphologically Different Robots Via Task Demonstrations |
|
Aktas, Hakan | The University of Cambridge |
Nagai, Yukie | The University of Tokyo |
Asada, Minoru | Open and Transdisciplinary Research Initiatives, Osaka Universit |
Oztop, Erhan | Osaka University / Ozyegin University |
Ugur, Emre | Bogazici University |
Keywords: Developmental Robotics, Imitation Learning, Deep Learning Methods
Abstract: We observe a large variety of robots in terms of their bodies, sensors, and actuators. Given the commonalities in the skill sets, teaching each skill to each different robot independently is inefficient and not scalable when the large variety in the robotic landscape is considered. If we can learn the correspondences between the sensorimotor spaces of different robots, we can expect a skill that is learned in one robot can be more directly and easily transferred to other robots. In this paper, we propose a method to learn correspondences hakan{among two or more robots that may have different morphologies. To be specific, besides robots with similar morphologies with different degrees of freedom, we show that a fixed-based manipulator robot with joint control and a differential drive mobile robot can be addressed within the proposed framework. To set up the correspondence among the robots considered, an initial base task is demonstrated to the robots to achieve the same goal. Then, a common latent representation is learned along with the individual robot policies for achieving the goal.} After the initial learning stage, the observation of a new task execution by one robot becomes sufficient to generate a latent space representation pertaining to the other robots to achieve the same task. We verified our system in a set of experiments where the correspondence between robots is learned (1) when the robots need to follow the same paths to achieve the same task, (2) when the robots need to follow different trajectories to achieve the same task, and (3) when complexities of the required sensorimotor trajectories are different for the robots. We also provide a proof-of-the-concept realization of correspondence learning between a real manipulator robot and a simulated mobile robot.
|
|
WeBT9 |
312 |
Multi-Robot Exploration |
Regular Session |
Co-Chair: Pedram, Ali Reza | Georgia Institute of Technology |
|
09:55-10:00, Paper WeBT9.1 | |
Planning-Oriented Cooperative Perception among Heterogeneous Vehicles |
|
Zheng, Han | Stony Brook University |
Ye, Fan | Stony Brook University |
Yang, Yuanyuan | Stony Brook University |
Keywords: Multi-Robot Systems, Cooperating Robots, Collision Avoidance
Abstract: Vehicle-to-vehicle (V2V) based cooperative perception enhances autonomous driving by overcoming single-agent perception limitations such as occlusions, without relying on extensive infrastructure. However, most existing methods have two key limitations. They treat cooperative perception in isolation, with little consideration for downstream tasks such as planning, leading to poor coordination and inefficient planning decisions. They also assume perception model homogeneity across all vehicles, which can be impractical among vehicles from different manufacturers. To bridge such gaps, we propose Scout, an early-fusion framework for planning-oriented cooperative perception among vehicles of heterogeneous models. Specifically, we formalize a notion of emph{Deltatheta-Risk Increment Distribution (RID)} to capture the distribution of the risk increment by incomplete perception to the current trajectory plan, and define a Priority Index (PI) metric for prioritizing cooperative perception on riskier regions. We develop algorithms to estimate emph{Delta theta-RID} and PI at run-time with theoretical bounds. Empirical results demonstrate that Scout surpasses state-of-the-art methods and strong baselines on challenging benchmarks, achieving higher success rates with only 3-10% of their communication volume.
|
|
10:00-10:05, Paper WeBT9.2 | |
TaskExp: Enhancing Generalization of Multi-Robot Exploration with Multi-Task Pre-Training |
|
Zhu, Shaohao | Zhejiang University |
Zhao, Yixian | Zhejiang University |
Xu, Yang | Zhejiang University |
Chen, Anjun | Zhejiang University |
Chen, Jiming | Zhejiang University |
Xu, Jinming | Zhejiang University |
Keywords: Reinforcement Learning, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents
Abstract: We aim to develop a general multi-agent reinforcement learning (MARL) policy that enables a group of robots to efficiently explore large-scale, unknown environments with random pose initialization. Existing MARL-based multi-robot exploration methods face challenges in reliably mapping observations to actions in large-scale scenarios and lack of zero-shot generalization to unknown environments. To this end, we propose a generic multi-task pre-training algorithm (termed TaskExp) to enhance the generalization of learning-based policies. In particular, we design a decision-related task to guide the policy to focus on valuable subspaces of the action space, improving the reliability of policy mapping. Moreover, two perception-related tasks--Location Estimation and Map Prediction--are designed to enhance the zero-shot capability of the policy by guiding it to extract general invariant features from unknown environments. With TaskExp pre-training, our policy significantly outperforms state-of-the-art planning-based methods in large-scale scenarios and demonstrates strong zero-shot performance in unseen environments. Furthermore, TaskExp can also be easily integrated to improve the existing learning-based multi-robot exploration methods.
|
|
10:05-10:10, Paper WeBT9.3 | |
WcDT: World-Centric Diffusion Transformer for Traffic Scene Generation |
|
Yang, Chen | Cardiff University |
He, Yangfan | University of Minnesota - Twin Cities |
Tian, Aaron Xuxiang | Independent Researcher |
Chen, Dong | Mississippi State University |
Wang, Jianhui | University of Electronic Science and Technology of China |
Shi, Tianyu | University of Toronto |
Heydarian, Arsalan | University of Virginia |
Liu, Pei | The Hong Kong University of Science and Technology(GuangZhou) |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Planning under Uncertainty, Deep Learning Methods
Abstract: In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a.k.a., diffusion models) and transformers. Our proposed framework, termed the "World-centric Diffusion Transformer"(WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed into "Agent Move Statement" and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders that is used to enhance the interaction of agents with other elements in the traffic scene. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems.
|
|
10:10-10:15, Paper WeBT9.4 | |
Hybrid Decentralization for Multi-Robot Orienteering with Mothership-Passenger Systems |
|
Butler, Nathan | Oregon State University |
Hollinger, Geoffrey | Oregon State University |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Marine Robotics
Abstract: We present a hybrid centralized-decentralized planning algorithm for a multi-robot system made up of a single Mothership robot and multiple Passenger robots. In this system, the Passenger robots execute tasks while the Mothership provides support. This paper addresses the challenge of planning Passenger robot movements, framing it as a Stochastic Multi-Agent Orienteering Problem (SMOP) complicated by factors like stochastic operational efforts and disruptive events. We optimize the task completion efficiency of the system by combining centralized solutions from the Mothership with local plans from Passengers to enhance system resilience. Our contributions include defining the SMOP, developing a distributed solution using decentralized Monte Carlo tree search, presenting a hybrid algorithm that integrates centralized plans into the distributed framework, and evaluating the algorithm’s performance in simulation using real-world data. Our results show that our hybrid approaches outperform fully centralized and fully distributed algorithms in dynamic and disruptive scenarios with up to 26.6% increase in task completion efficiency over baseline methods.
|
|
10:15-10:20, Paper WeBT9.5 | |
Communication-Aware Iterative Map Compression for Online Path-Planning |
|
Psomiadis, Evangelos | Georgia Institute of Technology |
Pedram, Ali Reza | Georgia Institute of Technology |
Maity, Dipankar | University of North Carolina at Charlotte |
Tsiotras, Panagiotis | Georgia Tech |
Keywords: Multi-Robot Systems, Mapping
Abstract: This paper addresses the problem of optimizing communicated information among heterogeneous, resource-aware robot teams to facilitate their navigation. In such operations, a mobile robot compresses its local map to assist another robot in reaching a target within an uncharted environment. The primary challenge lies in ensuring that the map compression step balances network load while transmitting only the most essential information for effective navigation. We propose a communication framework that sequentially selects the optimal map compression in a task-driven, communication-aware manner. It introduces a decoder capable of iterative map estimation, handling noise through Kalman filter techniques. The computational speed of our decoder allows for a larger compression template set compared to previous methods, and enables applications in more challenging environments. Specifically, our simulations demonstrate a remarkable 98% reduction in communicated information, compared to a framework that transmits the raw data, on a large Mars inclination map and an Earth map, all while maintaining similar planning costs. Furthermore, our method significantly reduces computational time compared to the state-of-the-art approach.
|
|
10:20-10:25, Paper WeBT9.6 | |
DiffCP: Ultra-Low Bit Collaborative Perception Via Diffusion Model |
|
Mao, Ruiqing | Tsinghua University |
Wu, Haotian | Imperial College London |
Jia, Yukuan | Tsinghua University |
Nan, Zhaojun | Tsinghua University |
Sun, Yuxuan | Beijing Jiaotong University |
Zhou, Sheng | Tsinghua University |
Gunduz, Deniz | İmperial College London |
Niu, Zhisheng | Tsinghua University |
Keywords: Cooperating Robots, Deep Learning for Visual Perception, Intelligent Transportation Systems
Abstract: Collaborative perception (CP) is emerging as a promising solution to the inherent limitations of stand-alone intelligence. However, current wireless communication systems are unable to support feature-level and raw-level collaborative algorithms due to their enormous bandwidth demands. In this paper, we propose DiffCP, a novel CP paradigm that utilizes a diffusion model to efficiently compress the sensing information of collaborators. By incorporating both geometric and semantic conditions into the generative model, DiffCP enables feature-level collaboration with an ultra-low communication cost, advancing the practical implementation of CP systems. This paradigm can be seamlessly integrated into existing CP algorithms to enhance a wide range of downstream tasks. Through extensive experimentation, we investigate the trade-offs between communication, computation, and performance. Numerical results demonstrate that DiffCP can significantly reduce communication costs by 14.5-fold while maintaining the same performance as the state-of-the-art algorithm.
|
|
WeBT10 |
313 |
Multi-Robot Path Planning 2 |
Regular Session |
Chair: Pierson, Alyssa | Boston University |
Co-Chair: Nam, Changjoo | Sogang University |
|
09:55-10:00, Paper WeBT10.1 | |
APF-CPP: An Artificial Potential Field Based Multi-Robot Online Coverage Path Planning Approach |
|
Wang, Zikai | Hongkong University of Science and Technology |
Zhao, Xiaoqi | The Hong Kong University of Science and Technology |
Zhang, Jiekai | Hong Kong Applied Science and Technology Research Institute |
Yang, Nachuan | Hong Kong University of Science and Technology |
Wang, Pengyu | Hong Kong University of Science and Technology |
Tang, Jiawei | Hong Kong University of Science and Technology |
Zhang, Jiuzhou | Hong Kong University of Science and Technology |
Shi, Ling | The Hong Kong University of Science and Technology |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Planning, Scheduling and Coordination
Abstract: Multi-robot coverage planning has gained significant attention in recent years. In this paper, we introduce a novel approach called APF-CPP (Artificial Potential Field Based Multi-Robot Online Coverage Path Planning) to enhance the collaboration of multi-robot systems to accomplish coverage tasks in unknown dynamic environments. Our approach presents a unique coverage policy that leverages the concept of artificial potential field (APF). In contrast to the conventional APF-based path planning methods that directly generate paths based on the field gradient, we utilize the APF to derive coverage policies for individual robots within a multi-robot system to achieve efficient task allocation and maintain regular coverage patterns. We have developed a policy update mechanism that allows the system to adapt its task allocation policy based on real-time conditions while minimizing the impact caused by policy changes. To better handle dead-end conditions, we use the APF concept to allocate tasks better during the dead-end recovery process. We also show that our algorithm has a low computational complexity and guarantees complete coverage in a finite time. We conduct extensive comparisons with other state-of-the-art (SOTA) approaches and validate our method through simulations and real-world experiments. The experimental results demonstrate the advantages of our proposed method over existing approaches and confirm the effectiveness and robustness of real-world implementation.
|
|
10:00-10:05, Paper WeBT10.2 | |
Exact Wavefront Propagation for Globally Optimal One-To-All Path Planning on 2D Cartesian Grids |
|
Ibrahim, Ibrahim | KU Leuven |
Gillis, Joris | KU Leuven |
Decré, Wilm | Katholieke Universiteit Leuven |
Swevers, Jan | KU Leuven |
Keywords: Motion and Path Planning, Path Planning for Multiple Mobile Robots or Agents, Computational Geometry
Abstract: This paper introduces an efficient mathcal{O}(n) compute and memory complexity algorithm for globally optimal path planning on 2D Cartesian grids. Unlike existing marching methods that rely on approximate discretized solutions to the Eikonal equation, our approach achieves exact wavefront propagation by pivoting the analytic distance function based on visibility. The algorithm leverages a dynamic-programming subroutine to efficiently evaluate visibility queries. Through benchmarking against state-of-the-art any-angle path planners, we demonstrate that our method outperforms existing approaches in both speed and accuracy, particularly in cluttered environments. Notably, our method inherently provides globally optimal paths to all grid points, eliminating the need for additional gradient descent steps per path query. The same capability extends to multiple starting positions. We also provide a greedy version of our algorithm as well as open-source C++ implementation of our solver.
|
|
10:05-10:10, Paper WeBT10.3 | |
ICBSS: An Improved Algorithm for Multi-Agent Combinatorial Path Finding |
|
Chen, Zheng | Zhejiang University |
Chen, Changlin | University of Science and Technology of China |
Yiran, Ni | Zhejiang University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance, Multi-Robot Systems
Abstract: The Multi-Agent Combinatorial Path Finding (MCPF) problem is a generalized version of the Multi-Agent Path Finding (MAPF) problem, in which each agent must collectively visit multiple intermediate target locations on the way to their final destinations. The state-of-the-art approach for addressing MCPF, known as Conflict-Based Steiner Search (CBSS) cite{DBLP:journals/trob/RenRC23}, leverages K-best joint sequences to create multiple search trees, and employs CBS-like search to resolve collisions for each tree. Despite its optimality guarantee, CBSS is computationally burdensome due to the duplicated collision resolutions across multiple trees and the computation of the K-best joint sequences. To address these challenges, we propose a novel algorithm called Improved Conflict-Based Steiner Search (ICBSS), aiming at expediting CBSS by replacing the multi trees with a single conflict tree (CT), which can be implemented by interleaving the time-dependent traveling salesman algorithm to compute the optimal joint path for agents under the newly generated constraints in each CT vertex. Additionally, we introduce a sub-optimal variant of ICBSS, which improves computational efficiency at the expense of solution optimality. Empirical results show that ICBSS outperforms state-of-the-art MCPF algorithms on a variety of MAPF instances.
|
|
10:10-10:15, Paper WeBT10.4 | |
Escaping Local Minima: Hybrid Artificial Potential Field with Wall-Follower for Decentralized Multi-Robot Navigation |
|
Kim, Joonkyung | Sogang University |
Park, Sangjin | Sogang University |
Lee, Wonjong | Sogang University |
Kim, Woojun | Carnegie Mellon University |
Choi, Hyunga | Korea University |
Doh, Nakju | Korea University |
Nam, Changjoo | Sogang University |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Reactive and Sensor-Based Planning
Abstract: We tackle the challenge of decentralized multi-robot navigation in environments with nonconvex obstacles, where complete environmental knowledge is unavailable. While reactive methods like Artificial Potential Field (APF) offer simplicity and efficiency, they suffer from local minima, causing robots to become trapped due to their lack of global environmental awareness. Other existing solutions either rely on inter-robot communication, are limited to single-robot scenarios, or struggle to navigate nonconvex obstacles effectively. Our proposed method enables collision-free navigation using only local sensor and state information without a map. By incorporating a wall-following (WF) behavior into the APF approach, our method allows robots to escape local minima, even in the presence of nonconvex and dynamic obstacles including other robots. We introduce two algorithms for switching between APF and WF: a rule-based system and an encoder network trained on expert demonstrations. Experimental results show that our approach achieves substantially higher success rates compared to state-of-the-art methods, highlighting its ability to overcome the limitations of local minima in complex environments.
|
|
10:15-10:20, Paper WeBT10.5 | |
Heterogeneous Exploration and Monitoring with Online Free-Space Ellipsoid Graphs |
|
Brodt, Brennan | Boston University |
Pierson, Alyssa | Boston University |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents, Cooperating Robots
Abstract: This paper proposes a heterogeneous teaming solution to the problem of target discovery and monitoring in unknown, non-convex environments. The team consists of two types of agents: agile agents with sensors capable of mapping their surroundings and slower agents that are capable of monitoring or servicing discovered targets. We propose an exploration algorithm that utilizes the IRIS algorithm to generate a graph decomposition from collision free ellipses contained within the environment. This graph is passed to the monitoring agents who execute polynomial complexity assignment and touring algorithms to generate high quality path plans which service all discovered targets. Our algorithmic structure allows the team to solve the problems of exploration, target discovery, assignment, and monitoring within unknown, non-convex environments efficiently using limited information. The performance of our proposed method is verified through batch simulations and complexity analysis.
|
|
10:20-10:25, Paper WeBT10.6 | |
Wavelet-Based Distributed Coverage for Heterogeneous Agents |
|
Rao, Ananya | Carnegie Mellon University |
Choset, Howie | Carnegie Mellon University |
Wettergreen, David | Carnegie Mellon University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Field Robots
Abstract: We develop a coverage approach for heterogeneous agents that leverages the different sensing and motion capabilities of a team. Coverage performance is measured using ergodicity, which when optimized balances exploitation versus exploration, where areas of interest are indicated with an information metric. Prior work uses spectral decomposition of a spatial map of information to guide a set of heterogeneous agents, each with different sensor and motion models, to optimize coverage. This work leverages wavelet transforms to decompose the information map rather than the Fourier transform typically applied to ergodic search and demonstrates the importance of selecting a suitable wavelet family to use, based on the information map being explored. Further a sequence of wavelets is used for decomposition to overcome dependency on selecting one suitable wavelet family. Our experimental results show that using wavelet families well-suited to the specific information map for information map decomposition leads to, on average, 43% improvement over a baseline method in terms of a standard coverage metric (ergodicity), while using a well-sequenced set of wavelets for decomposition leads to a 65% improvement in coverage performance across multiple types of information maps.
|
|
10:25-10:30, Paper WeBT10.7 | |
Multi-Agent Obstacle Avoidance Using Velocity Obstacles and Control Barrier Functions |
|
Sánchez Roncero, Alejandro | KTH Royal Institute of Technology |
Cabral Muchacho, Rafael Ignacio | KTH Royal Institute of Technology |
Ogren, Petter | Royal Institute of Technology (KTH) |
Keywords: Collision Avoidance, Multi-Robot Systems, Formal Methods in Robotics and Automation
Abstract: Velocity Obstacles (VO) methods form a paradigm for collision avoidance strategies among moving obstacles and agents. While VO methods perform well in simple multi-agent environments, they do not guarantee safety and can show overly conservative behavior in common situations. In this paper, we propose to combine a VO strategy for guidance with a Control Barrier Function approach for safety, which overcomes the overly conservative behavior of VOs and formally guarantees safety. We validate our method in a baseline comparison study, using second-order integrator and car-like dynamics. Results support that our method outperforms the baselines with respect to path smoothness, collision avoidance, and success rates.
|
|
WeBT11 |
314 |
Micro/Nano Robots |
Regular Session |
Co-Chair: Yoon, Jungwon | Gwangju Institutue of Science and Technology |
|
09:55-10:00, Paper WeBT11.1 | |
VALG: Vision-Based Adaptive Laser Gripper for Model-Free Pose Control of Floating Objects at Air-Liquid Interface |
|
Hui, Xusheng | Northwestern Polytechnical University |
Luo, Jianjun | Northwestern Polytechnical University(P.R.China) |
You, Haonan | Northwestern Polytechnical University |
Keywords: Micro/Nano Robots, Robust/Adaptive Control, Grippers and Other End-Effectors
Abstract: Non-contact manipulation at the air-liquid interface holds significant potential for applications in microrobotics, non-invasive assembly, and biochemistry analysis. However, achieving simultaneous position and orientation (pose) control of floating objects remains a considerable challenge, particularly for adaptive control without prior modeling of the objects. Here, we introduce the Vision-based Adaptive Laser Gripper (VALG) system addressing these challenges. By leveraging the distributed thermocapillary flow induced by patterned laser scanning, a pose control strategy based on the equidistant contour scanning laser is proposed and validated. The proposed system relies solely on visual recognition to generate adaptive laser grippers, which achieve static equilibrium to simultaneously constrain the position and orientation of the floating objects. Experimental validation demonstrates the effectiveness of the VALG system in independent position and orientation control, coupled pose control, and path following. The VALG system facilitates smooth, precise, fast, and adaptive pose control of generalized floating objects, establishing it as a universal and versatile platform for non-contact manipulation at the air-liquid interface.
|
|
10:00-10:05, Paper WeBT11.2 | |
In-Plane Manipulation of Soft Micro-Fiber with Ultrasonic Transducer Array and Microscope |
|
Zou, Jieyun | Shanghaitech University |
An, Siyuan | Shanghaitech University |
Wang, Mingyue | Shanghaitech Univerisity |
Li, Jiaqi | ShanghaiTech University |
Shi, Yalin | The School of Control Science and Engineering (CSE) of Shandong |
Li, You-Fu | City University of Hong Kong |
Liu, Song | ShanghaiTech University |
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Nanomanufacturing
Abstract: Noncontact manipulation of soft micro-fibers has great potential in advanced manufacturing, materials science, and biomedical engineering. However, current noncontact manipulation techniques primarily focus on objects with regular shapes, e.g., solid particles, cells, or droplets, with fewer solutions available for manipulating flexible and elongated structures. In this paper, an automated ultrasonic manipulation system is introduced for in-plane soft micro-fiber manipulation, which mainly consists of an ultrasonic transducer array and a microscope. A real-time trap generation algorithm is designed to manipulate the micro-fibers by the visual feedback from microscope. An adequate theoretical analysis is also provided for explanation of the deformation behavior of micro-fiber under external forces. The system is capable of precise in-plane positioning and motion trajectory planning to micro-fiber end, and in-plane morphological reshaping to the micro-fiber. Experiments validated the effectiveness of the proposed system for the in-plane manipulation of soft micro-fibers. Finally, the system was showcased by the practical application of material property characterization.
|
|
10:05-10:10, Paper WeBT11.3 | |
Interactive OT Gym: A Reinforcement Learning-Based Interactive Optical Tweezer (OT)-Driven Microrobotics Simulation Platform |
|
Zongcai, Tan | Imperial College London |
Zhang, Dandan | Imperial College London |
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots
Abstract: Optical tweezers (OT) offer unparalleled capabilities for micromanipulation with submicron precision in biomedical applications. However, controlling conventional multi-trap OT to achieve cooperative manipulation of multiple complex-shaped microrobots in dynamic environments poses a significant challenge. To address this, we introduce Interactive OT Gym, a reinforcement learning (RL)-based simulation platform designed for OT-driven microrobotics. Our platform supports complex physical field simulations and integrates haptic feedback interfaces, RL modules, and context-aware shared control strategies tailored for OT-driven microrobot in cooperative biological object manipulation tasks. This integration allows for an adaptive blend of manual and autonomous control, enabling seamless transitions between human input and autonomous operation. We evaluated the effectiveness of our platform using a cell manipulation task. Experimental results show that our shared control system significantly improves micromanipulation performance, reducing task completion time by approximately 67% compared to using pure human or RL control alone and achieving a 100% success rate. With its high fidelity, interactivity, low cost, and high-speed simulation capabilities, Interactive OT Gym serves as a user-friendly training and testing environment for the development of advanced interactive OT-driven micromanipulation systems and control algorithms.
|
|
10:10-10:15, Paper WeBT11.4 | |
Model-Based Robotic Cell Aspiration: Tackling the Impact of Air Segment |
|
Zheng, Jiachun | Chinese University of HongKong, Shenzhen |
Zhang, Zhuoran | The Chinese University of Hong Kong, Shenzhen |
Keywords: Automation at Micro-Nano Scales, Biological Cell Manipulation
Abstract: Cell aspiration is a common micro-manipulation technique for cell transfer, particularly in textit{in vitro} fertilization (IVF) procedures. The minuscule volume of a cell (pL) and limited damping provided by the medium make it challenging to accurately and quickly aspirate a cell to the desired position inside the micropipette. Experienced clinicians intentionally insert an air segment inside the micropipette in advance to make the aspiration easier. Nevertheless, the unclear damping effects and the varying initial length of the air segment in each aspiration pose difficulties for most operators. Inadequate judgment and response may lead to overshoot or even loss of the cell. This paper constructs a nonlinear dynamics model to elucidate the cell motion inside a micropipette containing an inserted air segment. The model reveals the impact of the air segment. A model-based controller is designed to facilitate the accurate aspiration of human sperm to a desired position, incorporating an estimated initial length of the air segment. Experiments were conducted to quantitatively evaluate the performance of both the model and the controller involving various initial air segment lengths. The results demonstrated a 100% success rate in 50 sperm aspiration experiments, achieving an average positional accuracy within pm2 pixels and an average settling time of 5.89 seconds.
|
|
10:15-10:20, Paper WeBT11.5 | |
Efficient Optimization of a Permanent Magnet Array for a Stable 2D Trap |
|
Müller, Ann-Sophia | German Cancer Research Center (DKFZ) |
Jeong, Moonkwang | Deutsches Krebsforschungszentrum (DKFZ) |
Tian, Jiyuan | German Cancer Research Center |
Zhang, Meng | German Cancer Research Center (DKFZ) |
Qiu, Tian | German Cancer Research Center (DKFZ) |
Keywords: Automation at Micro-Nano Scales, Micro/Nano Robots, Optimization and Optimal Control
Abstract: Untethered magnetic manipulation of biomedical millirobots has a high potential for minimally invasive surgical applications. However, it is still challenging to exert high actuation forces on the small robots over a large distance. Permanent magnets offer stronger magnetic torques and forces than electromagnetic coils, however, feedback control is more difficult. As proven by Earnshaw's theorem, it is not possible to achieve a stable magnetic trap in 3D by static permanent magnets. Here, we report a stable 2D magnetic force trap by an array of permanent magnets to control a millirobot. The trap is located in an open space with a tunable distance to the magnet array in the range of 20 - 120mm, which is relevant to human anatomical scales. The design is achieved by a novel GPU-accelerated optimization algorithm that uses mean squared error (MSE) and Adam optimizer to efficiently compute the optimal angles for any number of magnets in the array. The algorithm is verified using numerical simulation and physical experiments with an array of two magnets. A millirobot is successfully trapped and controlled to follow a complex trajectory. The algorithm demonstrates high scalability by optimizing the angles for 100 magnets in under three seconds. Moreover, the optimization workflow can be adapted to optimize a permanent magnet array to achieve the desired force vector fields.
|
|
10:20-10:25, Paper WeBT11.6 | |
Real-Time 3D MPI-Based Navigation Scheme for Microrobots with Flexible Field Free Point Trajectories and Virtual FFP Intuitive Manipulation |
|
Bui, Minh Phu | Gwangju Institute of Science and Technology |
Park, Myungjin | Gwangju Institute of Science and Technology |
Le, Tuan Anh | Gwangju Institute of Science and Technology |
Yoon, Jungwon | Gwangju Institutue of Science and Technology |
Keywords: Micro/Nano Robots, Medical Robots and Systems, Motion Control
Abstract: Magnetic Particle Imaging (MPI)-based navigation shows significant potential for accurately guiding microrobots to desired target locations. Existing MPI-based navigation systems have been limited to two-dimensional planar movements due to increased computational load and a lack of efficient 3D actuator schemes. So we introduce a real-time 3D MPI-based navigation scheme for microrobot, utilizing a flexible field-free point (FFP) trajectory scanning scheme and 3D virtual FFP (vFFP) intuitive manipulation. The FFP trajectory is chosen flexibly to enhance temporal resolution. A virtual FFP force model for actuator function, with high potential for interactive manipulation, is used to linearize the magnetic force concerning the relative positions of microrobot and the actual FFP. The proposed concept has been validated using the available 3D amplitude modulation MPI system with a 90 mm bore size and a 4 T/m/µ0 gradient. By employing a flexible FFP trajectory, the MPI system can achieve an image sampling rate of up to 4 Hz for a 3D Field of View of 60 40 60 mm³, enabling real-time MPI-based navigation. Furthermore, the proposed navigation control strategy can reach any target outlet within the 3D blood model with a low mean error in vFFP linearization of less than 5%.
|
|
10:25-10:30, Paper WeBT11.7 | |
3D Noncontact Micro-Particle Manipulation with Acoustic Robot End-Effector under Microscope |
|
Wang, Mingyue | Shanghaitech Univerisity |
Li, Jiaqi | ShanghaiTech University |
Jia, Yuyu | ShanghaiTech University |
Sun, Zhenhuan | Shanghaitech University |
Su, Hu | Institute of Automation, Chinese Academy of Science |
Liu, Song | ShanghaiTech University |
Keywords: Automation at Micro-Nano Scales, Visual Servoing, Grippers and Other End-Effectors
Abstract: As an essential component of noncontact manipulation, acoustic manipulation has achieved great success in multidisciplinary research and applications. Although acoustic tweezers have made advancements in manipulating particles in air, handling individual particles with high precision in water remains challenging and inadequately addressed due to the difficulty in precisely characterizing and calibrating acoustic robot end-effectors from a robotic perspective. In this paper, we present a vision-based automated noncontact particle manipulation approach using an acoustic robot end-effector, which achieves precise and reliable particle manipulation in 3D space. Specifically, visual feedback is incorporated for microparticle localization, and a dynamic acoustic field modulation method is proposed for controlling the end-effector. The invisible robot end-effector is localized and characterized through hydrophone scanning. The proposed vision solution is capable of automated trapping and precise translation of micro-particles suspended in a water-based environment and is applicable to particles with both negative and positive impedance contrast against the medium. Experimental results demonstrate the effectiveness of this approach towards automated noncontact particle manipulation with an acoustic robot end-effector
|
|
WeBT12 |
315 |
Human-Robot Collaboration 2 |
Regular Session |
|
09:55-10:00, Paper WeBT12.1 | |
Dynamic Collaborative Workspace Based on Human Interference Estimation for Safe and Productive Human-Robot Collaboration |
|
Kamezaki, Mitsuhiro | The University of Tokyo |
Wada, Tomohiro | Waseda University |
Sugano, Shigeki | Waseda University |
Keywords: Human-Robot Collaboration, Human-Centered Automation, Industrial Robots
Abstract: Collaborative robots that operate safely close to workers without fences have attracted attention, but few examples of such human-robot collaboration (HRC) have been seen in factories. The main reason is the difficulty in balancing safety and productivity. Current fenceless HRC systems stop the robot when a human enters the collaborative workspace (C) where both human and robot can work to ensure safety, which ISO/TS15066 regulates. The robot stops even when the human is far enough away, so productivity is drastically decreased (FCW, Fixed C). If a system could identify the human-work area, designate it as a no-entry space in C for the robot (C^P), and dynamically set the closed C (C^C) with shrinking C by C^P, productivity would improve thanks to enabling the robot to work in C^C and safety would be ensured thanks to allowing the human to continue working in C^P. In this study, we propose a new concept of a dynamic collaborative workspace (DCW) that dynamically sets C^C and C^P based on the human’s predicted trajectory. It also provides visual and auditory prompts to enable the human to understand DCW states, i.e., when a human enters C, C is changed, and the robot is in emergency mode. We compared four HRC systems using a real robot arm: two conventional FCW ones with and without fences and two proposed DCW ones with and without a state indicator and found that the proposed system with a state indicator has the best productivity and ensures the same level of safety as the conventional system with fences.
|
|
10:00-10:05, Paper WeBT12.2 | |
Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration |
|
Renz, Heiko | TU Dortmund University |
Krämer, Maximilian | TU Dortmund University |
Hoffmann, Frank | Technische Universität Dortmund |
Bertram, Torsten | Technische Universität Dortmund |
Keywords: Human-Robot Collaboration, Reactive and Sensor-Based Planning, Optimization and Optimal Control
Abstract: Visual observation of objects is essential for many robotic applications, such as object reconstruction and manipulation, navigation, and scene understanding. Machine learning algorithms constitute the state-of-the-art in many fields but require vast data sets, which are costly and time-intensive to collect. Automated strategies for observation and exploration are crucial to enhance the efficiency of data gathering. Therefore, a novel strategy utilizing the Next-Best-Trajectory principle is developed for a robot manipulator operating in dynamic environments. Local trajectories are generated to maximize the information gained from observations along the path while avoiding collisions. We employ a voxel map for environment modeling and utilize raycasting from perspectives around a point of interest to estimate the information gain. A global ergodic trajectory planner provides an optional reference trajectory to the local planner, improving exploration and helping to avoid local minima. To enhance computational efficiency, raycasting for estimating the information gain in the environment is executed in parallel on the graphics processing unit. Benchmark results confirm the efficiency of the parallelization, while real-world experiments demonstrate the strategy’s effectiveness.
|
|
10:05-10:10, Paper WeBT12.3 | |
TriHRCBot: A Robotic Architecture for Triadic Human-Robot Collaboration through Mediated Object Alignment |
|
Semeraro, Francesco | The University of Manchester |
Leadbetter, James Hugo | BAE Systems Ltd |
Cangelosi, Angelo | University of Manchester |
Keywords: Human-Robot Collaboration, Human-Aware Motion Planning, Cognitive Control Architectures
Abstract: Human-robot collaboration has great potential in enhancing robot deployment at close proximity with people, especially in non-dyadic collaborations with multiple users. However, autonomous systems that are capable of handling such interactions in a physical domain are rare. This work proposes TriHRCBot, a robotic architecture designed to handle a collaborative task that involves two concurrent users. The architecture is sensitive to position, orientation, body lengths and state of the users in the interaction, and uses this information to adjust the pose of a target object to enable both users to act on it at the same time. A robotic system equipped with the TriHRCBot architecture was deployed in a user study in which 30 participants from the BAE Systems Academy for Skills and Knowledge Centre interacted with it during such multi-user collaborative task. The study shows that the participants considered TriHRCBot acceptable for the task at hand.
|
|
10:10-10:15, Paper WeBT12.4 | |
Open-Nav: Exploring Zero-Shot Vision-And-Language Navigation in Continuous Environment with Open-Source LLMs |
|
Qiao, Yanyuan | The University of Adelaide |
Lyu, Wenqi | The University of Adelaide |
Wang, Hui | The University of Adelaide, AIML |
Wang, Zixu | South China University of Technology |
Li, Zerui | Adelaide University |
Zhang, Yuan | The University of Adelaide |
Tan, Mingkui | South China University of Technology |
Wu, Qi | University of Adelaide |
Keywords: Human-Robot Collaboration, AI-Enabled Robotics, AI-Based Methods
Abstract: Vision-and-Language Navigation (VLN) tasks require an agent to follow textual instructions to navigate through 3D environments. Traditional approaches use supervised learning methods, relying heavily on domain-specific datasets to train VLN models. Recent methods try to utilize closed-source large language models (LLMs) like GPT-4 to solve VLN tasks in zero-shot manners, but face challenges related to expensive token costs and potential data breaches in real-world applications. In this work, we introduce Open-Nav, a novel study that explores open-source LLMs for zero-shot VLN in the continuous environment. Open-Nav employs a spatial-temporal chain-of-thought (CoT) reasoning approach to break down tasks into instruction comprehension, progress estimation, and decision-making. It enhances scene perceptions with fine-grained object and spatial knowledge to improve LLM's reasoning in navigation. Our extensive experiments in both simulated and real-world environments demonstrate that Open-Nav achieves competitive performance compared to using closed-source LLMs.
|
|
10:15-10:20, Paper WeBT12.5 | |
Integrating Field of View in Human-Aware Collaborative Planning |
|
Hsu, Ya-Chuan | University of Southern California |
Michael, Defranco | University of Southern California |
Patel, Rutvik Rakeshbhai | University of Southern California |
Nikolaidis, Stefanos | University of Southern California |
Keywords: Human-Robot Collaboration, Planning under Uncertainty, Human-Aware Motion Planning
Abstract: In human-robot collaboration (HRC), it is crucial for robot agents to consider humans' knowledge of their surroundings. In reality, humans possess a narrow field of view (FOV), limiting their perception. However, research on HRC often overlooks this aspect and presumes an omniscient human collaborator. Our study addresses the challenge of adapting to the evolving subtask intent of humans while accounting for their limited FOV. We integrate FOV within the human-aware probabilistic planning framework. To account for large state spaces due to considering FOV, we propose a hierarchical online planner that efficiently finds approximate solutions while enabling the robot to explore low-level action trajectories that enter the human FOV, influencing their intended subtask. Through user study with our adapted cooking domain, we demonstrate our FOV-aware planner reduces human's interruptions and redundant actions during collaboration by adapting to human perception limitations. We extend these findings to a virtual reality kitchen environment, where we observe similar collaborative behaviors.
|
|
10:20-10:25, Paper WeBT12.6 | |
PACE: Proactive Assistance in Human-Robot Collaboration through Action-Completion Estimation |
|
De Lazzari, Davide | University of Padua |
Terreran, Matteo | University of Padova |
Giacomuzzo, Giulio | University of Padova |
Jain, Siddarth | Mitsubishi Electric Research Laboratories (MERL) |
Falco, Pietro | University of Padova |
Carli, Ruggero | University of Padova |
Romeres, Diego | Mitsubishi Electric Research Laboratories |
Keywords: Human-Robot Collaboration, Assembly
Abstract: This paper introduces the Proactive Assistance through action-Completion Estimation (PACE) framework, designed to enhance human-robot collaboration through real-time monitoring of human progress. PACE incorporates a novel method that combines Dynamic Time Warping (DTW) with correlation analysis to track human task progression from hand movements. PACE trains a reinforcement learning policy from limited demonstrations to generate a proactive assistance policy that synchronizes robotic actions with human activities, minimizing idle time and enhancing collaboration efficiency. We validate the framework through user studies involving 12 participants, showing significant improvements in interaction fluency, reduced waiting times, and positive user feedback compared to traditional methods.
|
|
10:25-10:30, Paper WeBT12.7 | |
Improving Human-Robot Collaboration Via Computational Design |
|
Zhi, Jixuan | George Mason University |
Lien, Jyh-Ming | George Mason University |
Keywords: Service Robotics, Human-Aware Motion Planning, Simulation and Animation
Abstract: When robots enter our day-to-day lives, the shared space surrounding humans and robots is critical for facilitating Human-Robot collaboration. The design of shared space should satisfy humans' preferences and robots' efficiency. This work uses the kitchen as an example to illustrate the importance of good space designs in enhancing collaboration. Given the kitchen boundary, food stations, counters, and recipes, the proposed method determines the optimal placement of stations and counters that meet the requirements of kitchen design rules and improve performance. The key technical challenge is that the optimization method usually evaluates thousands of designs, and each evaluation analyzes the traffic flow of the space, which must solve many motion planning problems. To address this technical challenge, we use a decentralized motion planner that can solve multi-agent motion planning efficiently. Our results indicate that optimized kitchen designs can provide noticeable performance improvement to Human-Robot collaboration.
|
|
WeBT13 |
316 |
Multifingered Hands |
Regular Session |
Chair: Schimmels, Joseph | Marquette University |
|
09:55-10:00, Paper WeBT13.1 | |
A Vision-Based Force/Position Fusion Actuation-Sensing Scheme for Tendon-Driven Mechanism |
|
Chen, Shiwei | Harbin Institute of Technology |
Deng, Zhiming | Harbin Institute of Technology |
Gu, Haiyu | Harbin Institute of Technology |
Wei, Cheng | Harbin Institute of Technology |
Keywords: Multifingered Hands, Tendon/Wire Mechanism, Computer Vision for Automation
Abstract: Current robotic sensing systems typically employ multiple sensors to obtain position and force information. This usually leads to many challenges, such as high costs and complex wiring. In this paper,a vision-based force/position fusion actuation-sensing scheme is proposed. The scheme can measure the angles and torques of all joints with only one low-cost camera. Through careful design of the actuation-sensing mechanism, the camera can achieve high resolution and high bandwidth processing. The proposed angle measurement model and external torque measurement model are evaluated by rigorous experiments. The experimental results indicate that the designed mechanism shows excellent repeatability and accuracy. The average error for all angles is less than 1 degree, and the average maximum relative error for torque is 4.43%.
|
|
10:00-10:05, Paper WeBT13.2 | |
BODex: Scalable and Efficient Robotic Dexterous Grasp Synthesis Using Bilevel Optimization |
|
Chen, Jiayi | Peking University |
Ke, Yubin | Peking University |
Wang, He | Peking University |
Keywords: Grasping, Multifingered Hands, Big Data in Robotics and Automation
Abstract: Robotic dexterous grasping is important for interacting with the environment. To unleash the potential of data-driven models for dexterous grasping, a large-scale, high-quality dataset is essential. While gradient-based optimization offers a promising way for constructing such datasets, previous works suffer from limitations, such as inefficiency, strong assumptions in the grasp quality energy, or limited object sets for experiments. Moreover, the lack of a standard benchmark for comparing different methods and datasets hinders progress in this field. To address these challenges, we develop a highly efficient synthesis system and a comprehensive benchmark with MuJoCo for dexterous grasping. We formulate grasp synthesis as a bilevel optimization problem, combining a novel lower-level quadratic programming (QP) with an upper-level gradient descent process. By leveraging recent advances in CUDA-accelerated robotic libraries and GPU-based QP solvers, our system can parallelize thousands of grasps and synthesize over 49 grasps per second on a single 3090 GPU. Our synthesized grasps for Shadow, Allegro, and Leap hands all achieve a success rate above 75% in simulation, with a penetration depth under 1 mm, outperforming existing baselines on nearly all metrics. Compared to the previous large-scale dataset, DexGraspNet, our dataset significantly improves the performance of learning models, with a success rate from around 40% to 80% in simulation. Real-world testing of the trained model on the Shadow Hand achieves an 81% success rate across 20 diverse objects. The codes and datasets are released on our project page: https://pku-epic.github.io/BODex.
|
|
10:05-10:10, Paper WeBT13.3 | |
DemoStart: Demonstration-Led Auto-Curriculum Applied to Sim-To-Real with Multi-Fingered Robots |
|
Bauza Villalonga, Maria | Massachusetts Institute of Technology |
Chen, Jose Enrique | DeepMind |
Dalibard, Valentin | Google DeepMind |
Gileadi, Nimrod | Google |
Hafner, Roland | Google DeepMind |
Martins, Murilo | DeepMind |
Moore, Joss | Google DeepMind |
Pevceviciute, Rugile | Deepmind |
Laurens, Antoine, Marin, Alix | EPFL |
Rao, Dushyant | Google DeepMind |
Zambelli, Martina | Google DeepMind |
Riedmiller, Martin | DeepMind |
Scholz, Jonathan | Google Deepmind |
Bousmalis, Konstantinos | DeepMind |
Nori, Francesco | Google DeepMind |
Heess, Nicolas | Google Deepmind |
Keywords: Multifingered Hands, Reinforcement Learning, Dexterous Manipulation
Abstract: We present DemoStart, a novel auto-curriculum reinforcement learning method capable of learning complex manipulation behaviors on an arm equipped with a three-fingered robotic hand, from only a sparse reward and a handful of demonstrations in simulation. Learning from simulation drastically reduces the development cycle of behavior generation, and domain randomization techniques are leveraged to achieve successful zero-shot sim-to-real transfer. Transferred policies are learned directly from raw pixels from multiple cameras and robot proprioception. Our approach outperforms policies learned from demonstrations on the real robot and requires 100 times fewer demonstrations, collected in simulation. More details and videos in https://sites.google.com/view/demostart.
|
|
10:10-10:15, Paper WeBT13.4 | |
Dexterous Assembly Using a Planar Hand Having Programmable Passive Compliance |
|
Frye, Jacob | Marquette University |
Schimmels, Joseph | Marquette University |
Keywords: Compliance and Impedance Control, Multifingered Hands, Dexterous Manipulation
Abstract: Special purpose compliant end-effectors are effective in realizing task-appropriate passive compliance. This paper presents a programmable, 3-fingered, antagonistic, compliant hand (P3ACH) capable of realizing a desired compliant behavior within a large space of multidirectional compliant behaviors. Manipulation dexterity is demonstrated by performing different assembly tasks faster, more robustly, and with lower contact forces than an active system realizing the same compliant behavior.
|
|
10:15-10:20, Paper WeBT13.5 | |
GAGrasp: Geometric Algebra Diffusion for Dexterous Grasping |
|
Zhong, Tao | Princeton University |
Allen-Blanchette, Christine | Princeton University |
Keywords: Multifingered Hands, Deep Learning in Grasping and Manipulation, Dexterous Manipulation
Abstract: We propose GAGrasp, a novel framework for dexterous grasp generation that leverages geometric algebra representations to enforce equivariance to SE(3) transformations. By encoding the SE(3) symmetry constraint directly into the architecture, our method improves data and parameter efficiency while enabling robust grasp generation across diverse object poses. Additionally, we incorporate a differentiable physics-informed refinement layer, which ensures that generated grasps are physically plausible and stable. Extensive experiments demonstrate the model's superior performance in generalization, stability, and adaptability compared to existing methods.
|
|
10:20-10:25, Paper WeBT13.6 | |
Model Q-II: An Underactuated Hand with Enhanced Grasping Modes and Primitives for Dexterous Manipulation |
|
Dong, Yinkai | Harvard University |
Kim, Jehyeok | Yale University |
Patel, Vatsal | Yale University |
Feng, Huijuan | Southern University of Science and Technology |
Dollar, Aaron | Yale University |
Keywords: Grippers and Other End-Effectors, Mechanism Design, Multifingered Hands
Abstract: This paper introduces Model Q-II, an enhanced underactuated robotic hand designed to improve dexterous manipulation through expanded grasping modes and manipulation primitives. The Model Q-II incorporates tripod and enhanced power grasping modes, achieving increased versatility without adding additional actuators. The design employs passive mechanisms, such as lateral contact walls and a finger-locking system, to facilitate seamless transitions between modes, enabling precise pinch-to-tripod and pinch-to-power gating. These enhancements allow the hand to perform complex in-hand manipulations, including multi-directional object positioning. Theoretical analysis, simulations, and experimental evaluations validate the hand’s performance, demonstrating improved grasping force, range, and manipulation capabilities. The results highlight Model Q-II’s ability to handle various tasks, offering a robust, cost-effective solution for applications requiring both precise and powerful grasping.
|
|
10:25-10:30, Paper WeBT13.7 | |
Canonical Representation and Force-Based Pretraining of 3D Tactile for Dexterous Visuo-Tactile Policy Learning |
|
Wu, Tianhao | Peking University |
Li, Jinzhou | Cornell University |
Zhang, Jiyao | Peking University |
Mingdong Wu, Aaron | Peking University |
Dong, Hao | Peking University |
Keywords: Dexterous Manipulation, Multifingered Hands, Force and Tactile Sensing
Abstract: Tactile sensing plays a vital role in enabling robots to perform fine-grained, contact-rich tasks. However, the high dimensionality of tactile data, due to the large coverage on dexterous hands, poses significant challenges for effective tactile feature learning, especially for 3D tactile data, as there are no large standardized datasets and no strong pretrained backbones. To address these challenges, we propose a novel canonical representation that reduces the difficulty of 3D tactile feature learning and further introduces a force-based self-supervised pretraining task to capture both local and net force features, which are crucial for dexterous manipulation. Our method achieves an average success rate of 78% across four fine-grained, contact-rich dexterous manipulation tasks in real-world experiments, demonstrating effectiveness and robustness compared to other methods. Further analysis shows that our method fully utilizes both spatial and force information from 3D tactile data to accomplish the tasks. The videos can be viewed at https://3dtacdex.github.io.
|
|
WeBT14 |
402 |
Tracking and Prediction 3 |
Regular Session |
Co-Chair: Vitzilaios, Nikolaos | University of South Carolina |
|
09:55-10:00, Paper WeBT14.1 | |
Dynamic Compact Consensus Tracking for Aerial Robots |
|
Sun, XiaoLou | Southeast University |
Quan, Zhibin | Southeast University |
Zhang, Feng | Nanjing University of Posts and Telecommunications |
Li, Yuntian | PML |
Wang, Chunyan | Purple Mountain Laboratories |
Si, Wufei | Purple Mountain Laboratories |
Ni, Wenhui | Purple Mountain Laboratory |
Guan, Runwei | University of Liverpool |
Wu, Yuan | Purple Mountain Lab |
Meng, Shen | Purple Mountain Laboratories |
Huang, YongMing | PML |
Keywords: Visual Tracking, Deep Learning Methods, Visual Learning
Abstract: Existing one-stream trackers have attracted widespread attention. However, they are not applicable in realtime UAV tracking systems due to substantial computational overhead, especially when dynamic templates are introduced. To address this issue, we propose a novel Dynamic Compact Consensus Tracker (DC2T), constructed by stacking modules that each consists of a Compact Token Encoder (CTE) and Dynamic Consensus Attention (DCA). Unlike traditional methods that convert images into a large number of tokens, the CTE, inspired by ”superpixel”, extracts a compact set of representative tokens from both initial and dynamic templates, eliminating the need for a large token set. This strategic reduction in the number of compact tokens markedly decreases the computational load of CTE, enhancing the efficiency of subsequent attention operations. To achieve near-linear complexity of the DCA, compact dynamic template tokens (as keys) are re-queried by search tokens (as queries) to perform dynamic consensus on the aggregated tokens (as values). This arrangement seamlessly incorporates dynamic spatio-temporal features into the DCA while avoiding the computational burden typically associated with dynamic templates. With the aim of further enhancing the system’s responsiveness and accuracy, a direct control network is crafted to seamlessly incorporate the prediction of high-level control values into the tracking network, ensuring a cohesive and efficient interaction with the controller. Comprehensive experiments and real-world evaluations have proven DC2T’s superior performance, accompanied by a significant reduction in FLOPs. Furthermore, we have conducted experiments that demonstrate the tracker’s ability to integrate seamlessly with other technologies such as SLAM and detection, enabling precise tracking of arbitrary objects. The tracker code will be released in https://github.com/xiaolousun/refine-pytracking.git.
|
|
10:00-10:05, Paper WeBT14.2 | |
CGTrack: Cascade Gating Network with Hierarchical Feature Aggregation for UAV Tracking |
|
Li, Weihong | University of Chinese Academy of Sciences |
Liu, Xiaoqiong | University of North Texas |
Fan, Heng | University of North Texas |
Zhang, Libo | Iscas |
Keywords: Visual Tracking, Computer Vision for Automation, Visual Learning
Abstract: Recent advancements in visual object tracking have markedly improved the capabilities of unmanned aerial vehicle (UAV) tracking, which is a critical component in real-world robotics applications. While the integration of hierarchical lightweight networks has become a prevalent strategy for enhancing efficiency in UAV tracking, it often results in a significant drop in network capacity, which further exacerbates challenges in UAV scenarios, such as frequent occlusions and extreme changes in viewing angles. To address these issues, we in this paper introduce a novel family of UAV trackers, termed CGTrack, which combines both explicit and implicit techniques to expand network capacity within a coarse-to-fine framework. Specifically, we first introduce a Hierarchical Feature Cascade (HFC) module that leverages the spirit of feature reuse to increase network capacity by integrating the deep semantic cues with the rich spatial information, incurring minimal computational costs while enhancing feature representation. Based on this, we design a novel Lightweight Gated Center Head (LGCH) that utilizes gating mechanisms to decouple target-oriented coordinates from previously expanded features, which contain dense local discriminative information. Extensive experiments on three challenging UAV tracking benchmarks demonstrate that CGTrack achieves state-of-the-art performance while running fast. Code will be available at https://github.com/NightwatchFox11/CGTrack
|
|
10:05-10:10, Paper WeBT14.3 | |
Tracking Everything in Robotic-Assisted Surgery |
|
Zhan, Bohan | Imperial College London |
Zhao, Wang | Tsinghua University |
Fang, Yi | New York University |
Du, Bo | Wuhan University |
Vasconcelos, Francisco | University College London |
Stoyanov, Danail | University College London |
Elson, Daniel | Imperial College London |
Huang, Baoru | Imperial College London |
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy
Abstract: Accurate tracking of tissues and instruments in videos is crucial for Robotic-Assisted Minimally Invasive Surgery (RAMIS), as it enables the robot to comprehend the surgical scene with precise locations and interactions of tissues and tools. Traditional keypoint-based sparse tracking is limited by featured points, while flow-based dense two-view matching suffers from long-term drifts. Recently, the Tracking Any Point (TAP) algorithm was proposed to overcome these limitations and achieve dense accurate long-term tracking. However, its efficacy in surgical scenarios remains untested, largely due to the lack of a comprehensive surgical tracking dataset for evaluation. To address this gap, we introduce a new annotated surgical tracking dataset for benchmarking tracking methods for surgical scenarios, comprising real-world surgical videos with complex tissue and instrument motions. We extensively evaluate state-of-the-art (SOTA) TAP-based algorithms on this dataset and reveal their limitations in challenging surgical scenarios, including fast instrument motion, severe occlusions, and motion blur, etc. Furthermore, we propose a new tracking method, namely SurgMotion, to solve the challenges and further improve the tracking performance. Our proposed method outperforms most TAP-based algorithms in surgical instruments tracking, and especially demonstrates significant improvements over baselines in challenging medical videos.
|
|
10:10-10:15, Paper WeBT14.4 | |
LaMOT: Language-Guided Multi-Object Tracking |
|
Li, Yunhao | University of Chinese Academy of Sciences |
Liu, Xiaoqiong | University of North Texas |
Liu, Luke | Centennial High School |
Fan, Heng | University of North Texas |
Zhang, Libo | Iscas |
Keywords: Visual Tracking, Computer Vision for Automation, Visual Learning
Abstract: Vision-Language MOT is a critical tracking problem that has recently garnered increasing attention. It aims to track objects based on human language commands, displacing the traditional use of templates or pre-set information from training sets in conventional tracking tasks. However, a key challenge remains in understanding why language is used for tracking, hindering further development. In this paper, we introduce Language-Guided MOT, a unified task framework, and LaMOT, a corresponding large-scale benchmark, which encompasses diverse scenarios and language descriptions and comprises 1,660 sequences from 4 different datasets. The purpose of LaMOT is to unify various Vision-Language MOT tasks while providing a standardized evaluation platform. To ensure high-quality annotations, we manually assign appropriate descriptive texts to each target in every video and conduct careful inspection and correction. To our knowledge, LaMOT is the first benchmark dedicated to Language-Guided MOT. Additionally, we propose a simple yet effective tracker, termed LaMOTer. By establishing a unified task framework, providing challenging benchmarks, and offering insights for future algorithm design and evaluation, we expect to contribute to the advancement of research in Vision-Language MOT. We will release the data at https://github.com/Nathan-Li123/LaMOT.
|
|
10:15-10:20, Paper WeBT14.5 | |
Real-Time UAV Tracking: A Comparative Study of YOLOv8 with Object Tracking Algorithms |
|
Russo, Tyler | University of South Carolina |
Vitzilaios, Nikolaos | University of South Carolina |
Keywords: Visual Tracking
Abstract: Unmanned Aerial Vehicle (UAV) usage has rapidly increased leading to an effort to accurately and efficiently track UAVs. Many existing approaches utilize YOLO, a state-of-the art object detection model, in conjunction with object tracking algorithms to detect and follow UAVs in real-time. However, these systems typically focus on a single method, without considering alternative tracking methods. In this paper, we present an experimental comparison of multiple object tracking algorithms integrated with YOLOv8, offering a comprehensive evaluation of their performance in UAV tracking scenarios. First, the model size was optimized to determine the best balance between speed and accuracy. Then, various tracking methods are tested to determine the most effective combination. The YOLOv8 model combined with a Kernelized Correlation Filter outperformed various other trackers in varying environmental scenarios, with a combined success rate and tracking accuracy of 0.8041. This approach was further implemented in real-time on a Jetson Orion Nano GPU, utilizing a pan-tilt gimbal and an Intel RealSense D435i camera. Running at 20 FPS, the system demonstrated robustness and stability during motion and various environmental scenarios, highlighting its potential for integration into applications such as ground-based UAV surveillance.
|
|
10:20-10:25, Paper WeBT14.6 | |
MoD-SLAM: Monocular Dense Mapping for Unbounded 3D Scene Reconstruction |
|
Zhou, Heng | Columbia University |
Guo, Zhetao | Cloudspace Technology Co., Ltd |
Yuxiang, Ren | Beijing Dianjing Ciyuan Culture Communication Co. , Ltd., |
Liu, Shuhong | The University of Tokyo |
Zhang, Lechen | Columbia University |
Zhang, Kaidi | Columbia University |
Li, Mingrui | Dalian University of Technology |
Keywords: SLAM, Mapping, Localization
Abstract: Monocular SLAM has received a lot of attention due to its simple RGB inputs and the lifting of complex sensor constraints. However, existing monocular SLAM systems lack accurate depth estimation, which limits the accuracy of tracking and mapping performance. To address this limitation, we propose MoD-SLAM, the first monocular NeRF-based dense mapping method that allows 3D reconstruction in real-time in unbounded scenes. Specifically, we introduce a depth estimation module in the front-end to extract accurate priori depth values to supervise mapping and tracking processes. This strategy is essential to improve the SLAM performance. Moreover, a Gaussian-based unbounded scene representation approach is designed to solve the challenge of mapping scenes without boundaries. By introducing a robust depth loss term into the tracking process, our SLAM system achieves more precise pose estimation in large-scale scenes. Our experiments on two standard datasets show that MoD-SLAM achieves competitive performance, improving the accuracy of the 3D reconstruction and localization by up to 30% and 15% respectively compared with existing monocular SLAM systems.
|
|
10:25-10:30, Paper WeBT14.7 | |
A Certifiable Algorithm for Simultaneous Shape Estimation and Object Tracking |
|
Shaikewitz, Lorenzo | Massachusettes Institute of Technology |
Ubellacker, Samuel | Massachusetts Institute of Technology |
Carlone, Luca | Massachusetts Institute of Technology |
Keywords: RGB-D Perception, Visual Tracking, Optimization and Optimal Control
Abstract: Applications from manipulation to autonomous vehicles rely on robust and general object tracking to safely perform tasks in dynamic environments. We propose the first certifiably optimal category-level approach for simultaneous shape estimation and pose tracking of an object of known category (e.g. a car). Our approach uses 3D semantic keypoint measurements extracted from an RGB-D image sequence, and phrases the estimation as a fixed-lag smoothing problem. Temporal constraints enforce the object's rigidity (fixed shape) and smooth motion according to a constant-twist motion model. The solutions to this problem are the estimates of the object's state (poses, velocities) and shape (paramaterized according to the active shape model) over the smoothing horizon. Our key contribution is to show that despite the non-convexity of the fixed-lag smoothing problem, we can solve it to certifiable optimality using a small-size semidefinite relaxation. We also present a fast outlier rejection scheme that filters out incorrect keypoint detections with shape and time compatibility tests, and wrap our certifiable solver in a graduated non-convexity scheme. We evaluate the proposed approach on synthetic and real data, showcasing its performance in a table-top manipulation scenario and a drone-based vehicle tracking application.
|
|
WeBT15 |
403 |
Surgical Robotics: Laparoscopy |
Regular Session |
|
09:55-10:00, Paper WeBT15.1 | |
Hypergraph-Transformer (HGT) for Interaction Event Prediction in Laparoscopic and Robotic Surgery |
|
Yin, Lianhao | MIT |
Ban, Yutong | Shanghai Jiao Tong University |
Eckhoff, Jennifer A | MGH |
Meireles, Ozanan | MGH |
Rus, Daniela | MIT |
Rosman, Guy | Massachusetts Institute of Technology |
Keywords: Medical Robots and Systems, Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy
Abstract: Understanding and anticipating events and actions is critical for intraoperative assistance and decision-making during minimally invasive surgery. We propose a predictive neural network that is capable of understanding and predicting critical interaction aspects of surgical workflow based on endoscopic, intracorporeal video data, while flexibly leveraging surgical knowledge graphs. The approach incorporates a hypergraph-transformer (HGT) structure that encodes expert knowledge into the network design and predicts the hidden embedding of the graph. We verify our approach on established surgical datasets and applications, including the prediction of action-triplets, and the achievement of the Critical View of Safety (CVS), which is a critical safety measure. Moreover, we address specific, safety-related forecasts of surgical processes, such as predicting the clipping of the cystic duct or artery without prior achievement of the CVS. Our results demonstrate improvement in prediction of interactive event when incorporating with our approach compared to unstructured alternatives.
|
|
10:00-10:05, Paper WeBT15.2 | |
Robotic Flexible Magnetic Retractor for Dynamic Tissue Manipulation in Endoscopic Submucosal Dissection |
|
Chan, Wai Shing | The Chinese University of Hong Kong |
Sun, Yichong | The Chinese University of Hong Kong |
Li, Yehui | The Chinese University of Hong Kong |
Li, Jixiu | The Chinese University of Hong Kong |
Yip, Hon Chi | The Chinese University of Hong Kong |
Chiu, Philip, Wai-yan | Chinese University of Hong Kong |
Li, Zheng | The Chinese University of Hong Kong |
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Surgical Robotics: Steerable Catheters/Needles
Abstract: Endoscopic submucosal dissection (ESD) is a procedure targeted for early gastrointestinal cancer. Traction plays a crucial role in enhancing the efficiency of cutting lesions, thereby reducing procedural complexity and duration. From the perspective of traction devices, current non-magnetic ones hold shortcomings in complicating the workspace in directional tissue manipulation; Current magnetic traction devices cannot be prepared before the procedure, and require the withdrawal of endoscope in the midway to re-introduce the magnetic retractor to the lesion site. Towards these plights, this paper introduces a robotic flexible magnetic retractor designed for tissue manipulation during ESD. Precisely, the flexible prototype can be seamlessly inserted through the instrument channel of an endoscope to the lesion site without the need for endoscope withdrawal. Moreover, the introduction of robotic magnetic actuation enhances the agile control of magnetic retractors while alleviating the surgeon’s workload in magnetic-retractorassisted ESD. The experimental results validate the functionality and efficacy of the prototype magnetic retractor in magnetic traction-assisted ESD procedures. The retractor demonstrated its ability to provide adequate traction and accomplish clinical tasks. This innovative approach holds promise for enhancing the efficiency and outcomes of ESD procedures, offering a compelling alternative to traditional traction methods.
|
|
10:05-10:10, Paper WeBT15.3 | |
Leveraging Surgical Activity Grammar for Primary Intention Prediction in Laparoscopy Procedures |
|
Zhang, Jie | Huazhong University of Science and Technology |
Zhou, Song | Huazhong University of Science and Technology |
Wang, Yiwei | Huazhong University of Science and Technology |
Wan, Chidan | Huazhong University of Science and Technology |
Zhao, Huan | Huazhong University of Science and Technology |
Cai, Xiong | Huazhong University of Science and Technology |
Ding, Han | Huazhong University of Science and Technology |
Keywords: Surgical Robotics: Laparoscopy, Surgical Robotics: Planning, Recognition
Abstract: Surgical procedures are inherently complex and dynamic, with intricate dependencies and various execution paths. Accurate identification of the intentions behind critical actions, referred to as Primary Intentions (PIs), is crucial to understanding and planning the procedure. This paper presents a novel framework that advances PI recognition in instructional videos by combining top-down grammatical structure with bottom-up visual cues. The grammatical structure is based on a rich corpus of surgical procedures, offering a hierarchical perspective on surgical activities. A grammar parser, utilizing the surgical activity grammar, processes visual data obtained from laparoscopic images through surgical action detectors, ensuring a more precise interpretation of the visual information. Experimental results on the benchmark dataset demonstrate that our method outperforms existing surgical activity detectors that rely solely on visual features. Our research provides a promising foundation for developing advanced robotic surgical systems with enhanced planning and automation capabilities.
|
|
10:10-10:15, Paper WeBT15.4 | |
SLAM Assisted 3D Tracking System for Laparoscopic Surgery |
|
Song, Jingwei | University of Michigan |
Zhang, Ray | University of Michigan |
Zhang, Wenwei | Wuhan United Imaging Surgical Co., Ltd |
Zhou, Hao | Shanghai United Imaging Healthcare Advanced Technology Research |
Ghaffari, Maani | University of Michigan |
Keywords: Surgical Robotics: Laparoscopy, Visual Tracking, SLAM
Abstract: A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the ORB-SLAM2 monocular mode. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking, and the 3D shape is incorporated as a geometric prior in its pose graph optimization. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and ''organ-background'' relative motion.
|
|
10:15-10:20, Paper WeBT15.5 | |
SurgPose: Generalisable Surgical Instrument Pose Estimation Using Zero-Shot Learning and Stereo Vision |
|
Rai, Utsav | Imperial College London |
Xu, Haozheng | Imperial College London |
Giannarou, Stamatia | Imperial College London |
Keywords: Surgical Robotics: Laparoscopy, Localization, Visual Tracking
Abstract: Accurate pose estimation of surgical tools in Robot-assisted Minimally Invasive Surgery (RMIS) is essential for surgical navigation and robot control. While traditional marker-based methods offer accuracy, they face challenges with occlusions, reflections, and tool-specific designs. Similarly, supervised learning methods require extensive training on annotated datasets, limiting their adaptability to new tools. Despite their success in other domains, zero-shot pose estimation models remain unexplored in RMIS for pose estimation of surgical instruments, creating a gap in generalising to unseen surgical tools. This paper presents a novel 6 Degrees of Freedom (DoF) pose estimation pipeline for surgical instruments, leveraging state-of-the-art zero-shot RGB-D models like the FoundationPose and SAM-6D. We advanced these models by incorporating vision-based depth estimation using the RAFT-Stereo method, for robust depth estimation in reflective and textureless environments. Additionally, we enhanced SAM-6D by replacing its instance segmentation module, Segment Anything Model (SAM), with a fine-tuned Mask R-CNN, significantly boosting segmentation accuracy in occluded and complex conditions. Extensive validation reveals that our enhanced SAM-6D surpasses FoundationPose in zero-shot pose estimation of unseen surgical instruments, setting a new benchmark for zero-shot RGB-D pose estimation in RMIS. This work enhances the generalisability of pose estimation for unseen objects and pioneers the application of RGB-D zero-shot methods in RMIS.
|
|
10:20-10:25, Paper WeBT15.6 | |
Design and Effectiveness of Virtual Monitors and AR-Based Endoscope Control for Robotically Assisted Laparoscopic Surgery |
|
Budjakoski, Nikola | ImFusion GmbH |
Schneider, Dominik | German Aerospace Center (DLR) |
Song, Tianyu | Technical University of Munich |
Sommersperger, Michael | Technical University of Munich |
Weber, Bernhard | German Aerospace Center |
Navab, Nassir | TU Munich |
Klodmann, Julian | German Aerospace Center |
Keywords: Surgical Robotics: Laparoscopy, Virtual Reality and Interfaces
Abstract: Managing indirect access in laparoscopy as a minimally invasive procedure poses challenges to physicians. In particular, an endoscope must be navigated to achieve adequate visualization of the surgical anatomy, while coping with unergonomic poses, tremor, and fatigue. Furthermore, the alignment of visual perception and physical movement, dictated by the endoscope's position relative to the monitor, can lead to hand-eye coordination challenges. We propose unified deployment of a robotic endoscope holder together with an augmented reality display to counteract the aforementioned challenges in laparoscopy. Our augmented reality system provides an interactive, stereoscopic, virtual monitor displaying an endoscopic stream. In addition, our method design enables direct control of the robotic endoscope holder. Our user study demonstrates the potential of the proposed method to significantly improve hand-eye coordination, while insights from our usability study for robotic control indicate promising trends, including high usability and low cognitive demand.
|
|
10:25-10:30, Paper WeBT15.7 | |
MEDiC: Autonomous Surgical Robotic Assistance to Maximizing Exposure for Dissection and Cautery |
|
Liang, Xiao | University of California San Diego |
Wang, Chung-Pang | University of California, San Diego |
Shinde, Nikhil | University of California San Diego |
Liu, Fei | University of Tennessee Knoxville |
Richter, Florian | University of California, San Diego |
Yip, Michael C. | University of California, San Diego |
Keywords: Surgical Robotics: Laparoscopy, Surgical Robotics: Planning, Medical Robots and Systems
Abstract: Surgical automation has the capability to improve the consistency of patient outcomes and broaden access to advanced surgical care in underprivileged communities. Shared autonomy, where the robot automates routine subtasks while the surgeon retains partial teleoperative control, offers great potential to make an impact. In this paper we focus on one important skill within surgical shared autonomy: Automating robotic assistance to maximize visual exposure and apply tissue tension for dissection and cautery. Ensuring consistent exposure to visualize the surgical site is crucial for both efficiency and patient safety. However, achieving this is highly challenging due to the complexities of manipulating deformable volumetric tissues that are prevalent in surgery. To address these challenges we propose MEDiC, a framework for autonomous surgical robotic assistance to maximizing exposure for dissection and cautery. We integrate a differentiable physics model with perceptual feedback to achieve our two key objectives: 1) Maximizing tissue exposure and applying tension for a specified dissection site through visual-servoing conrol and 2) Selecting optimal control positions for a dissection target based on deformable Jacobian analysis. We quantitatively assess our method through repeated real robot experiments on a tissue phantom, and showcase its capabilities through dissection experiments using shared autonomy on real animal tissue.
|
|
WeBT16 |
404 |
Deformable Object Manipulation |
Regular Session |
Chair: Hoffmann, Matej | Czech Technical University in Prague, Faculty of Electrical Engineering |
|
09:55-10:00, Paper WeBT16.1 | |
DeformPAM: Data-Efficient Learning for Long-Horizon Deformable Object Manipulation Via Preference-Based Action Alignment |
|
Chen, Wendi | Shanghai Jiao Tong University |
Xue, Han | Shanghai Jiao Tong University |
Zhou, Fangyuan | Shanghai Jiao Tong University |
Fang, Yuan | Shanghai Jiaotong University |
Lu, Cewu | ShangHai Jiao Tong University |
Keywords: Learning from Demonstration, Imitation Learning, Bimanual Manipulation
Abstract: In recent years, imitation learning has made progress in the field of robotic manipulation. However, it still faces challenges when dealing with complex long-horizon deformable object tasks, such as high-dimensional state spaces, complex dynamics, and multimodal action distributions. Traditional imitation learning methods often require a large amount of data and encounter distributional shifts and accumulative errors in these tasks. To address these issues, we propose a data-efficient general learning framework (DeformPAM) based on preference learning and reward-guided action selection. DeformPAM decomposes long-horizon tasks into multiple action primitives, utilizes 3D point cloud inputs and diffusion models to model action distributions, and trains an implicit reward model using human preference data. During the inference phase, the reward model scores multiple candidate actions, selecting the optimal action for execution, thereby reducing the occurrence of anomalous actions and improving task completion quality. Experiments conducted on three challenging real-world long-horizon deformable object manipulation tasks demonstrate the effectiveness of this method. Results show that DeformPAM improves both task completion quality and efficiency compared to baseline methods even with limited data. Code and data will be available at deform-pam.robotflow.ai.
|
|
10:00-10:05, Paper WeBT16.2 | |
Autonomous Bimanual Manipulation of Deformable Objects Using Deep Reinforcement Learning Guided Adaptive Control |
|
Liu, Jiayi | Huazhong University of Science and Technology |
Yang, Sihang | Huazhong University of Science and Technology |
Wang, Yiwei | Huazhong University of Science and Technology |
Zhao, Huan | Huazhong University of Science and Technology |
Ding, Han | Huazhong University of Science and Technology |
Keywords: Medical Robots and Systems, Surgical Robotics: Laparoscopy, Deep Learning in Grasping and Manipulation
Abstract: Deformable object manipulation (DOM) which is a common subtask in various surgical procedures represents an inevitable challenge in robot-assisted surgery (RAS) due to complex nonlinear deformation. This paper proposes a deep reinforcement learning guided adaptive control (RLAC) model-free framework, which combines learning-based and Jacobian-based methods. To complement each other for optimized performance, we harness the sampling of deep reinforcement learning (DRL) policy explored in simulations to solve a reasonable estimation of the initial deformation Jacobian. In early control iterations, the actions suggested by the DRL agent are adopted until the estimated real-time Jacobian approximates the actual deformation model. Subsequently, the independent Jacobian-based adaptive control (AC) with sufficient initial deformation awareness begins execution to achieve precise internal feature manipulation on deformable objects. Experimental results demonstrate that our method enables more efficient positioning and exhibits near-optimal positioning paths. RLAC with robust sim-to-real performance provides a feasible approach for the complex autonomous DOM in the real world.
|
|
10:05-10:10, Paper WeBT16.3 | |
Embedded IPC: Fast and Intersection-Free Simulation in Reduced Subspace for Robot Manipulation |
|
Du, Wenxin | University of California, Los Angeles |
Yu, Chang | University of California, Los Angeles |
Ma, Siyu | University of California, Los Angeles |
Jiang, Ying | University of California, Los Angeles |
Zong, Zeshun | University of California, Los Angeles |
Yang, Yin | University of Utah |
Masterjohn, Joseph | Toyota Research Institute |
Castro, Alejandro | Toyota Research Institute |
Han, Xuchen | Toyota Research Institute |
Jiang, Chenfanfu | University of California, Los Angeles |
Keywords: Simulation and Animation, Contact Modeling
Abstract: Physics-based simulation is essential for developing and evaluating robot manipulation policies, particularly in scenarios involving deformable objects and complex contact interactions. However, existing simulators often struggle to balance computational efficiency with numerical accuracy, especially when modeling deformable materials with frictional contact constraints. We introduce an efficient subspace representation for the Incremental Potential Contact (IPC) method, leveraging model reduction to decrease the number of degrees of freedom. Our approach decouples simulation complexity from the resolution of the input model by representing elasticity in a low-resolution subspace while maintaining collision constraints on an embedded high-resolution surface. Our barrier formulation ensures intersection-free trajectories and configurations regardless of material stiffness, time step size, or contact severity. We validate our simulator through quantitative experiments with a soft bubble gripper grasping and qualitative demonstrations of placing a plate on a dish rack. The results demonstrate our simulator's efficiency, physical accuracy, computational stability, and robust handling of frictional contact, making it well-suited for generating demonstration data and evaluating downstream robot training applications.
|
|
10:10-10:15, Paper WeBT16.4 | |
A Highly Robust Contact Sensor for Precise Contact Detection of Fabric |
|
Ling, Zhengrong | The Hong Kong University of Science and Technology |
Hong, Lanxuan | Hkust |
Yang, Xiong | Hong Kong University of Science and Technology |
Tang, Yifeng | City University of Hong Kong |
Guo, Dong | City University of Hong Kong |
Shen, Yajing | The Hong Kong University of Science and Technology |
Keywords: Industrial Robots, Perception for Grasping and Manipulation, Contact Modeling
Abstract: Automation in the apparel and textile industry has long been a pursuit. However, accurately locating the surface of a fabric remains a challenge, limiting the automation in sorting, packaging, and other processes. When humans locate clothing, they rely on contact feedback for the exact position of the clothing surface. As existing contact detection solutions are significantly affected by environmental factors, it is essential to develop a sensor with robust contact detection capabilities. In this work, we introduce a contact sensor with high robustness and high force resolution. This contact sensor detects contact by measuring the deformation of an elastomer using a distance-measuring module. Based on the deformation characteristics of the elastomer, we designed a detection algorithm that not only reduces the noise of data but also extracts features such as trends and elastomer states, enabling reliable contact detection. Through experiments, we validated that this contact sensor can detect contact forces as low as 0.017 N and is robust to external interference or sensor movement. We also verified that the sensor can process data within 7.5 ms and return contact detection with 95% accuracy. Additionally, we assessed its effectiveness in real fabric contact scenarios.
|
|
10:15-10:20, Paper WeBT16.5 | |
Design, Modelling, and Experimental Verification of Passively Adaptable Roller Gripper for Separating Stacked Fabric |
|
Unde, Jayant | Nagoya University |
Colan, Jacinto | Nagoya University |
Hasegawa, Yasuhisa | Nagoya University |
Keywords: Grippers and Other End-Effectors, Grasping, Contact Modeling
Abstract: This study presents a novel approach to fabric manipulation through the development and optimization of a single-actuator-driven roller gripper. Focused on addressing the challenges inherent in handling fabrics with diverse thicknesses and materials, our gripper employs a passive adaptable mechanism driven by springs, enabling effective manipulation of fabrics ranging from 0.1mm to 2.25mm in thickness. We analyze gripper-fabric interaction forces to identify the parameters that influence successful grasping. We then optimize the gripper’s normal forces and the roller’s tangential force using the proposed model. Systematic evaluations demonstrated the gripper’s capability to separate individual layers from fabric stacks, achieving a 94.9% success rate across multiple fabric types. Overall, this research offers a compact, cost-effective solution with broad applicability in diverse industrial automation contexts, providing valuable insights for advancing robotic fabric handling systems. The gripper’s design is open-access and available for rapid development and customization at https://github.com/JayantUnde/Gripper.
|
|
10:20-10:25, Paper WeBT16.6 | |
Closed-Loop Shape Control of Deformable Linear Objects Based on Cosserat Model |
|
Artinian, Azad | ISIR - Sorbonne Université |
Ben Amar, Faiz | Université Pierre Et Marie Curie, Paris 6 |
Perdereau, Véronique | Sorbonne University |
Keywords: Dual Arm Manipulation, Visual Servoing, Modeling, Control, and Learning for Soft Robots
Abstract: The robotic shape control of deformable linear objects has garnered increasing interest within the robotics community. Despite recent progress, the majority of shape control approaches can be classified into two main groups: open-loop control, which relies on physically realistic models to represent the object, and closed-loop control, which employs less precise models alongside visual data to compute commands. In this work, we present a novel 3D shape control approach that includes the physically realistic Cosserat model into a closedloop control framework, using vision feedback to rectify errors in real-time. This approach capitalizes on the advantages of both groups: the realism and precision provided by physics-based models, and the rapid computation, therefore enabling real-time correction of model errors, and robustness to elastic parameter estimation inherent in vision-based approaches. This is achieved by computing a deformation Jacobian derived from both the Cosserat model and visual data. To demonstrate the effectiveness of the method, we conduct a series of shape control experiments where robots are tasked with deforming linear objects towards a desired shape.
|
|
10:25-10:30, Paper WeBT16.7 | |
Single-Grasp Deformable Object Discrimination: The Effect of Gripper Morphology, Sensing Modalities, and Action Parameters |
|
Pliska, Michal | Czech Technical University in Prague, Faculty of Electrical Engi |
Patni, Shubhan | Ceske Vysoke Uceni Technicke V Praze, FEL |
Mareš, Michal | Faculty of Electrical Engineering, Czech Technical University In |
Stoudek, Pavel | Technology Innovation Institute (TII), Abu Dhabi |
Straka, Zdenek | Czech Technical University in Prague, Faculty of Electrical Engi |
Stepanova, Karla | Czech Technical University |
Hoffmann, Matej | Czech Technical University in Prague, Faculty of Electrical Engi |
Keywords: Grippers and Other End-Effectors, Force and Tactile Sensing, Recognition, Multifingered Hands
Abstract: In haptic object discrimination, the effect of gripper embodiment, action parameters, and sensory channels has not been systematically studied. We used two anthropomorphic hands and two 2-finger grippers to grasp two sets of deformable objects. On the object classification task, we found: (i) among classifiers, SVM on sensory features and LSTM on raw time series performed best across all grippers; (ii) faster compression speeds degraded performance; (iii) generalization to different grasping configurations was limited; transfer to different compression speeds worked well for the Barrett Hand only. Visualization of the feature spaces using PCA showed that the gripper morphology and the action parameters were the main source of variance, rendering generalization across embodiment or grasp configurations very hard. On the highly challenging dataset consisting of polyurethane foams alone, only the Barrett Hand achieved excellent performance. Tactile sensors can thus provide a key advantage even if recognition is based on stiffness rather than shape. The dataset with 24000 measurements is publicly available.
|
|
WeBT17 |
405 |
Soft Actuators 2 |
Regular Session |
Chair: Markvicka, Eric | University of Nebraska-Lincoln |
Co-Chair: Papadopoulos, Evangelos | National Technical University of Athens |
|
09:55-10:00, Paper WeBT17.1 | |
Introducing Mag-Nets: Rapidly Bending Electromagnetic Actuators for Self-Contained Soft Robots |
|
Bolanakis, Georgios | National Technical University of Athens |
Papadopoulos, Evangelos | National Technical University of Athens |
Keywords: Soft Sensors and Actuators, Modeling, Control, and Learning for Soft Robots, Soft Robot Materials and Design
Abstract: Present electromagnetic soft actuators rely on external magnetic fields or power supplies, while the very few that operate autonomously produce weak actuating forces, limiting their practicality. This work introduces a novel current-controlled electromagnetic actuator that employs copper coils and permanent magnets to produce substantial driving forces. The actuator can serve as a building block for independently controlled actuating networks to develop sophisticated self-contained soft robots and grippers. The design, inspired by fast pneu-net (fPN) actuators, ensures minimal bending resistance from the silicone body and, thus, allows high-speed bending motions. Two applications of the prototype actuator are studied; a two-fingered soft gripper realizing bending speeds of up to 1491°/s and maximum grasping force of 1.19 N, and an entirely self-contained crawling soft robot utilizing friction anisotropy to generate forward locomotion. A lumped-element model is developed and validated experimentally to describe the dynamics of the gripper’s soft finger. Pick-and-place tasks on various targets, and tests on the crawling robot demonstrate, overall, the effectiveness of the developed actuator. The uniqueness of Mag-Nets, lying in their control simplicity, enhanced capability and cost-effectiveness, sets the foundations for a new design approach for soft robots and grippers.
|
|
10:00-10:05, Paper WeBT17.2 | |
Miniature Dielectric Elastomer Actuator Probe Inspecting Confined Spaces Embedding a CMOS Sensor |
|
Sandhu, Sahib | University of Connecticut |
Li, Ang (Leo) | University of Toronto |
Tugui, Codrin | University of Connecticut |
Duduta, Mihai | University of Connecticut |
Keywords: Soft Robot Applications, Soft Robot Materials and Design, Soft Sensors and Actuators
Abstract: Navigating and inspecting confined space is crucial for the aerospace and healthcare industries. Exploring smaller and narrower spaces allows for problems to be identified earlier, preventing negative outcomes for patients and equipment. The challenge is to scale down the navigation probe while preserving degrees of freedom (DOF) and functionality. Dielectric elastomer actuators (DEAs) are promising probe candidates because they are solid-state, electrical-driven, and can be scaled down favorably. This work demonstrates a modular 2-DOF DEA miniature probe with an embedded CMOS sensor for visual data acquisition. The modularity achieved by a novel hinge system enables switching between single and dual DEA probes based on 2D or 3D pathway structures. The probes can be controlled using a pocket-sized circuit with two knobs to turn. We present the operating mechanism, device assembly, fabrication, and characterization of DEA bending actuators with widths below 2mm. In the end, we demonstrate the ability of devices to navigate through various complex and confined pathways.
|
|
10:05-10:10, Paper WeBT17.3 | |
Portable, High-Frequency, and High-Voltage Control Circuits for Untethered Miniature Robots Driven by Dielectric Elastomer Actuators |
|
Shao, Qi | Tsinghua University |
Liu, Xin-Jun | Tsinghua University |
Zhao, Huichan | Tsinghua University |
Keywords: Soft Robot Applications, Soft Sensors and Actuators, Soft Robot Materials and Design
Abstract: In this work, we propose a high-voltage, high-frequency control circuit for the untethered applications of dielectric elastomer actuators (DEAs). The circuit board leverages low-voltage resistive components connected in series to control voltages of up to 1.8 kV within a compact size, suitable for frequencies ranging from 0 to 1 kHz. A single-channel control board weighs only 2.5 g. We tested the performance of the control circuit under different load conditions and power supplies. Based on this control circuit, along with a commercial miniature high-voltage power converter, we construct an untethered crawling robot driven by a cylindrical DEA. The 42-g untethered robots successfully obtained crawling locomotion on a bench and within a pipeline at a driving frequency of 15 Hz, while simultaneously transmitting real-time video data via an onboard camera and antenna. Our work provides a practical way to use low-voltage control electronics to achieve the untethered driving of DEAs, and therefore portable and wearable devices.
|
|
10:10-10:15, Paper WeBT17.4 | |
Stretchable Electrohydraulic Artificial Muscle for Full Motion Ranges in Musculoskeletal Antagonistic Joints |
|
Kazemipour, Amirhossein | ETH Zürich |
Hinchet, Ronan | ETH Zurich |
Katzschmann, Robert Kevin | ETH Zurich |
Keywords: Soft Sensors and Actuators, Soft Robot Materials and Design, Compliant Joints and Mechanisms
Abstract: Artificial muscles play a crucial role in musculoskeletal robotics and prosthetics to approximate the force-generating functionality of biological muscle. However, current artificial muscle systems are typically limited to either contraction or extension, not both. This limitation hinders the development of fully functional artificial musculoskeletal systems. We address this challenge by introducing an artificial antagonistic muscle system capable of both contraction and extension. Our design integrates non-stretchable electrohydraulic soft actuators (HASELs) with electrostatic clutches within an antagonistic musculoskeletal framework. This configuration enables an antagonistic joint to achieve a full range of motion without displacement loss due to tendon slack. We implement a synchronization method to coordinate muscle and clutch units, ensuring smooth motion profiles and speeds. This approach facilitates seamless transitions between antagonistic muscles at operational frequencies of up to 3.2 Hz. While our prototype utilizes electrohydraulic actuators, this muscle-clutch concept is adaptable to other non-stretchable artificial muscles, such as McKibben actuators, expanding their capability for extension and full range of motion in antagonistic setups. Our design represents a significant advancement in the development of fundamental components for more functional and efficient artificial musculoskeletal systems, bringing their capabilities closer to those of their biological counterparts.
|
|
10:15-10:20, Paper WeBT17.5 | |
Beyond Traversing in a Thin Pipe: Self-Sensing Odometry of a Pipeline Robot Driven by High-Frequency Dielectric Elastomer Actuators |
|
Cheng, Ran | Tsinghua University |
Shao, Qi | Tsinghua University |
Liu, Xin-Jun | Tsinghua University |
Zhao, Huichan | Tsinghua University |
Keywords: Soft Robot Applications, Soft Sensors and Actuators, Soft Robot Materials and Design
Abstract: In this paper, we propose an earthworm-inspired miniature pipeline robot capable of self-sensing odometry. The robot features a dielectric elastomer actuator as its elongation body and two specially designed passive anchors to achieve unidirectional motion without slipping. The odometry was achieved through the self-sensing scheme of DEAs and the summation of all step sizes over a period. The careful implementation of the self-sensing method resulted in a small sensing resolution of 0.05 mm at a high actuation frequency of 20 Hz for a cylindrical DEA. Finally, the robot obtained a self-sensing odometry in a pipe, showing good consistency with the ground truth. This work paves a new way for a miniature in-pipe robot to sense its own state without additional sensors to save space and power.
|
|
10:20-10:25, Paper WeBT17.6 | |
Intelligent Self-Healing Artificial Muscle: Mechanisms for Damage Detection and Autonomous Repair of Puncture Damage in Soft Robotics |
|
Krings, Ethan | University of Nebraska-Lincoln |
McManigal, Patrick | University of Nebraska-Lincoln |
Markvicka, Eric | University of Nebraska-Lincoln |
Keywords: Soft Robot Materials and Design, Soft Sensors and Actuators, Soft Robot Applications
Abstract: Soft robotics are characterized by their high deformability, mechanical robustness, and inherent resistance to damage. These unique properties present exciting new opportunities to enhance both emerging and existing fields such as healthcare, manufacturing, and exploration. However, to function effectively in unstructured environments, these technologies must be able to withstand the same real-world conditions that human skin and other soft biological materials are typically subjected to. Here, we present a novel soft material architecture designed for active detection of material damage and autonomous repair in soft robotic actuators. By integrating liquid metal (LM) microdroplets within a silicone elastomer, the system can detect and localize damage through the formation of conductive pathways that arise from extreme pressure or puncture events. These newly formed conductive networks function as in situ Joule heating elements, facilitating the reprocessing and healing of the material. The architecture allows for the reconfiguration of the newly formed electrical network using high current densities, employing electromigration and thermal mechanisms to restore functionality without manual intervention. This innovative approach not only enhances the resilience and performance of soft materials but also supports a wide range of applications in soft robotics and wearable technologies, where adaptive and autonomous systems are crucial for operation in dynamic and unpredictable environments.
|
|
10:25-10:30, Paper WeBT17.7 | |
High-Force Electroadhesion Based on Unique Liquid-Solid Dielectrics for UAV Perching |
|
Luo, Junjie | The Chinese University of Hong Kong |
Li, Jisen | Shenzhen Institute of Artificial Intelligence and Robotics for S |
Wang, Hongqiang | Southern University of Science and Technology |
Zhu, Jian | Chinese University of Hong Kong, Shenzhen |
Keywords: Soft Sensors and Actuators, Soft Robot Applications, Soft Robot Materials and Design
Abstract: Electroadhesion (EA), as an electrostatically driven, controllable adhesion technology, has unique attributes such as low noise, robust adaptability, and energy efficiency. However, its adhesion pressure is still low (0.1~10kPa) which may significantly limit its applications. This paper presents an innovative electroadhesion pad embedded with liquid and solid dielectrics. The experiments demonstrate that this liquid-solid electroadhesion pad (LSEAP) is capable of much larger adhesion pressure, compared to the traditional solid electroadhesion pad (SEAP). On one hand, the LSEAP can increase the dielectric contact with the substrate. On the other hand, the actuator can increase its dielectric strength. We also explore application of this actuator to perching of a commercial Unmanned Aerial Vehicle (UAV), in order to promote the UAV’s sustainable flight. Notably, the untethered LSEAP system, with an adhesion area as small as 4 cm² and a self-weight as light as 8.7 g, can support a UAV of 249.7 g for stable adhesion on various surfaces. The adhesion pressure generated by our LSEAD can be 32.2kPa, significantly larger than those reported in the literature. The weight ratio of the UAV to the LSEAP system is 14.6, more than double those in the previous studies. The integration of this EA system markedly prolongs the operational duration of UAVs, rendering them suitable for sustainable surveillance and reconnaissance missions. This LSEAP also marks a pivotal advancement towards adhesion-based applications such as grippers and wall-climbing robots.
|
|
WeBT18 |
406 |
Intelligent Transportation Systems and AI-Based Methods |
Regular Session |
Co-Chair: Rosman, Guy | Massachusetts Institute of Technology |
|
09:55-10:00, Paper WeBT18.1 | |
Multi-Scale Convolutional Networks with Class-Normalized Logit Clipping for Robust Sea State Estimation from Noisy Ship Motion Data |
|
Qin, Xin | Tianjin University of Technology |
Liu, Mengna | Tianjin University of Technology |
Cheng, Xu | Smart Innovation Norway |
Liu, Xiufeng | Technical University of Denmark |
Shi, Fan | Tianjin University of Technology |
Zhang, Jianhua | Tianjin University of Technology |
Chen, Shengyong | Tianjin University of Technology |
Keywords: Intelligent Transportation Systems, Deep Learning Methods, Big Data in Robotics and Automation
Abstract: Autonomous ships utilize automation systems to achieve unmanned navigation, driving innovation in maritime transportation. However, sea conditions, inffuenced by dynamic factors such as wave height, wind speed, and ocean currents, present a challenge in accurately assessing these conditions. Traditional classification models often assume accurate labels, but noisy labels are prevalent in real-world applications. Existing methods, such as noise sample filtering or loss function adjustment, have limited applicability and poor generalization when dealing with complex sea condition data. To address this issue, this study proposes an end-to-end neural network model. The model’s feature extraction module uses deep representation learning to capture latent patterns in the data, and a loss function is designed to mitigate the impact of outliers. The integration of these components allows the model to perform accurate classification even in the presence of noisy labels. Extensive experiments on public and sea condition datasets validate the effectiveness of this approach, demonstrating that the model exhibits strong generalization capabilities and holds great promise for practical applications.
|
|
10:00-10:05, Paper WeBT18.2 | |
Directed-CP: Directed Collaborative Perception for Connected and Autonomous Vehicles Via Proactive Attention |
|
Tao, Yihang | City University of Hong Kong |
Hu, Senkang | City University of Hong Kong |
Fang, Zhengru | City University of Hong Kong |
Fang, Yuguang | City Universty of Hong Kong |
Keywords: Intelligent Transportation Systems, AI-Based Methods, Cooperating Robots
Abstract: Collaborative perception (CP) leverages visual data from connected and autonomous vehicles (CAV) to expand an ego vehicle’s field of view (FoV). Despite recent progress, current CP methods do expand the ego vehicle’s 360-degree perceptual range almost equally, but faces two key challenges. Firstly, in areas with uneven traffic distribution, focusing on directions with little traffic offers limited benefits. Secondly, under limited communication budgets, allocating excessive bandwidth to less critical directions lowers the perception accuracy in more vital areas. To address these issues, we propose Directed-CP, a proactive and direction-aware CP system aiming at improving CP in specific directions. Our key idea is to enable an ego vehicle to proactively signal its interested directions and readjust its attention to enhance local directional CP performance. To achieve this, we first propose an RSU-aided direction masking mechanism that assists an ego vehicle in identifying vital directions. Additionally, we design a direction-aware selective attention module to wisely aggregate pertinent features based on ego vehicle’s directional priorities, communication budget, and the positional data of CAVs. Moreover, we introduce a direction-weighted detection loss (DWLoss) to capture the divergence between directional CP outcomes and the ground truth, facilitating effective model training. Extensive experiments on the V2X-Sim 2.0 dataset demonstrate that our approach achieves 19.8% higher local perception accuracy in interested directions and 2.5% higher overall perception accuracy than the state-of-the-art methods in collaborative 3D object detection tasks.
|
|
10:05-10:10, Paper WeBT18.3 | |
Motion Forecasting Via Model-Based Risk Minimization |
|
Distelzweig, Aron | Albert-Ludwigs-Universität Freiburg |
Kosman, Eitan | Bosch |
Andreas, Look | Bosch |
Janjoš, Faris | Robert Bosch GmbH |
Manivannan, Denesh Kumar | TU Delft |
Valada, Abhinav | University of Freiburg |
Keywords: Intelligent Transportation Systems, AI-Based Methods, Behavior-Based Systems
Abstract: Forecasting the future trajectories of surrounding agents is crucial for autonomous vehicles to ensure safe, efficient, and comfortable route planning. While model en- sembling has improved prediction accuracy in various fields, its application in trajectory prediction is limited due to the multi-modal nature of predictions. In this paper, we propose a novel sampling method applicable to trajectory prediction based on the predictions of multiple models. We first show that conventional sampling based on predicted probabilities can degrade performance due to missing alignment between models. To address this problem, we introduce a new method that generates optimal trajectories from a set of neural networks, framing it as a risk minimization problem with a variable loss function. By using state-of-the-art models as base learners, our approach constructs diverse and effective ensembles for optimal trajectory sampling. Extensive experiments on the nuScenes prediction dataset demonstrate that our method surpasses current state-of-the-art techniques, achieving top ranks on the leaderboard. We also provide a comprehensive empirical study on ensembling strategies, offering insights into their effectiveness. Our findings highlight the potential of advanced ensembling techniques in trajectory prediction, significantly improving predictive performance and paving the way for more reliable predicted trajectories.
|
|
10:10-10:15, Paper WeBT18.4 | |
Computational Teaching for Driving Via Multi-Task Imitation Learning |
|
Edakkattil Gopinath, Deepak | Toyota Research Institute |
Cui, Xiongyi | Toyota Research Institute |
DeCastro, Jonathan | Cornell University |
Sumner, Emily | Toyota Research Institute |
Costa, Jean | Toyota Research Institute |
Yasuda, Hiroshi | Toyota Research Institute |
Morgan, Allison | Toyota Research Institute |
Dees, Laporsha | Toyota Research Institute |
Chau, Sheryl | Toyota Research Institute |
Leonard, John | MIT |
Chen, Tiffany | Toyota Research Institute |
Rosman, Guy | Massachusetts Institute of Technology |
Balachandran, Avinash | Toyota Research Institue |
Keywords: Human Performance Augmentation, Imitation Learning, Intelligent Transportation Systems
Abstract: Learning motor skills for sports or performance driving is often done with professional instruction from expert human teachers, whose availability is limited. Our goal is to enable automated teaching via a learned model that interacts with the student similar to a human teacher. However, training such automated teaching systems is limited by the availability of high-quality annotated datasets of expert teacher and student interactions as they are difficult to collect at scale. To address this data scarcity problem, we propose an approach for training a coaching system for complex motor tasks such as high performance driving via a Multi-Task Imitation Learning (MTIL) paradigm. MTIL allows our model to learn robust representations by utilizing self supervised training signals from more readily available non- interactive datasets of humans performing the task of interest. We validate our approach with (1) a semi-synthetic dataset created from real human driving trajectories, (2) a professional track driving instruction dataset, (3) a track racing driving simulator human-subject study, and (4) a system demonstration on an instrumented car at a race track. Our experiments show that the right set of auxiliary machine learning tasks improves prediction of teaching instructions. Moreover, in the human subjects study, students exposed to the instructions from our teaching system improve their ability to stay within track limits, and show favorable perception of the model’s interaction with them, in terms of usefulness and satisfaction.
|
|
10:15-10:20, Paper WeBT18.5 | |
A Comprehensive LLM-Powered Framework for Driving Intelligence Evaluation |
|
You, Shanhe | Institute for AI Industry Research, Tsinghua University |
Luo, Xuewen | Monash University |
Liang, Xinhe | National University of Singapore |
Yu, Jiashu | Tsinghua University |
Zheng, Chen | Institute for AI Industry Research, Tsinghua University |
Gong, Jiangtao | Tsinghua University |
Keywords: Human-Centered Automation, Intelligent Transportation Systems, Performance Evaluation and Benchmarking
Abstract: Evaluation methods for autonomous driving are crucial for algorithm optimization. However, due to the complexity of driving intelligence, there is currently no comprehensive evaluation method for the level of autonomous driving intelligence. In this paper, we propose an evaluation framework for driving behavior intelligence in complex traffic environments, aiming to fill this gap. We constructed a natural language evaluation dataset of human professional drivers and passengers through naturalistic driving experiments and post-driving behavior evaluation interviews. Based on this dataset, we developed an LLM-powered driving evaluation framework. The effectiveness of this framework was validated through simulated experiments in the CARLA urban traffic simulator and further corroborated by human assessment. Our research provides valuable insights for evaluating and designing more intelligent, human-like autonomous driving agents. The implementation details of the framework and detailed information about the dataset can be found at the Github.
|
|
10:20-10:25, Paper WeBT18.6 | |
LoRD: Adapting Differentiable Driving Policies to Distribution Shifts |
|
Diehl, Christopher | TU Dortmund University |
Karkus, Peter | NVIDIA |
Veer, Sushant | NVIDIA |
Pavone, Marco | Stanford University |
Bertram, Torsten | Technische Universität Dortmund |
Keywords: Intelligent Transportation Systems, Integrated Planning and Learning, Transfer Learning
Abstract: Distribution shifts between operational domains can severely affect the performance of learned models in self-driving vehicles (SDVs). While this is a well-established problem, prior work has mostly explored naive solutions such as fine-tuning, focusing on the motion prediction task. In this work, we explore novel adaptation strategies for differentiable autonomy stacks (structured policy) consisting of prediction, planning, and control, perform evaluation in closed-loop, and investigate the often-overlooked issue of catastrophic forgetting. Specifically, we introduce two simple yet effective techniques: a low-rank residual decoder (LoRD) and multi-task fine-tuning. Through experiments across three models conducted on two real-world autonomous driving datasets (nuPlan, exiD), we demonstrate the effectiveness of our methods and highlight a significant performance gap between open-loop and closed-loop evaluation in prior approaches. Our approach improves forgetting by up to 23.33% and the closed-loop OOD driving score by 9.93% in comparison to standard fine-tuning.
|
|
10:25-10:30, Paper WeBT18.7 | |
BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes |
|
Kulathun Mudiyanselage, Kasun Weerakoon | University of Maryland, College Park |
Elnoor, Mohamed | University of Maryland |
Seneviratne, Gershom Devake | University of Maryland, College Park |
Rajagopal, Vignesh | University of Maryland, College Park |
Arul, Senthil Hariharan | University of Maryland, College Park |
Liang, Jing | University of Maryland |
M Jaffar, Mohamed Khalid | University of Maryland, College Park |
Manocha, Dinesh | University of Maryland |
Keywords: Perception-Action Coupling, AI-Based Methods, Motion and Path Planning
Abstract: We present BehAV, a novel approach for autonomous robot navigation in outdoor scenes guided by human instructions and leveraging Vision Language Models (VLMs). Our method interprets human commands using a Large Language Model (LLM) and categorizes the instructions into navigation and behavioral guidelines. Navigation guidelines consist of directional commands (e.g., "move forward until") and associated landmarks (e.g., "the building with blue windows"), while behavioral guidelines encompass regulatory actions (e.g., "stay on") and their corresponding objects (e.g., "pavements"). We use VLMs for their zero-shot scene understanding capabilities to estimate landmark locations from RGB images for robot navigation. Further, we introduce a novel scene representation that utilizes VLMs to ground behavioral rules into a behavioral cost map. This cost map encodes the presence of behavioral objects within the scene and assigns costs based on their regulatory actions. The behavioral cost map is integrated with a LiDAR-based occupancy map for navigation. To navigate outdoor scenes while adhering to the instructed behaviors, we present an unconstrained Model Predictive Control (MPC)-based planner that prioritizes both reaching landmarks and following behavioral guidelines. We evaluate the performance of BehAV on a quadruped robot across diverse real-world scenarios, demonstrating a 22.49% improvement in alignment with human-teleoperated actions, as measured by Fréchet distance, and achieving a 40% higher navigation success rate compared to state-of-the-art methods.
|
|
WeBT19 |
407 |
State Estimation |
Regular Session |
Co-Chair: Xiong, Xiaobin | University of Wisconsin Madison |
|
09:55-10:00, Paper WeBT19.1 | |
An Adaptive Graduated Nonconvexity Loss Function for Robust Nonlinear Least Squares Solutions |
|
Jung, Kyungmin | McGill University |
Hitchcox, Thomas | McGill University |
Forbes, James Richard | McGill University |
Keywords: Graduated nonconvexity, Robust/Adaptive Control of Robotic Systems, SLAM, Learning and Adaptive Systems
Abstract: Many problems in robotics, such as estimating the state from noisy sensor data or aligning two point clouds, can be posed and solved as least-squares problems. Unfortunately, vanilla nonminimal solvers for least-squares problems are notoriously sensitive to outliers and initialization errors. The conventional approach to outlier rejection is to use a robust loss function, which is typically selected and tuned a priori. A newly developed approach to handle large initialization errors is graduated nonconvexity (GNC), which is defined for a particular choice of a robust loss function. The main contribution of this paper is to combine these two approaches by using an adaptive kernel within a GNC optimization scheme. This produces least-squares problems that are robust to both outliers and initialization errors, without the need for model selection and tuning. Simulations and experiments demonstrate that the proposed method is more robust compared to non-GNC counterparts and performs on par with other GNC-tailored loss functions. An Example code can be found at https://github.com/decargroup/gnc-adapt.
|
|
10:00-10:05, Paper WeBT19.2 | |
Learning Direct Solutions in Moving Horizon Estimation with Deep Learning Methods |
|
Lionti, Fabien | INRIA |
Gutowski, Nicolas | University of Angers, LERIA |
Aubin, Sébastien | DGA |
Martinet, Philippe | INRIA |
Keywords: Deep Learning Methods, Optimization and Optimal Control
Abstract: State estimation in the context of dynamical systems is crucial for various applications, including control and monitoring. Moving Horizon Estimation (MHE) is an optimization-based state estimation algorithm that leverages a known dynamical model integrated over a moving horizon. The MHE optimization criterion corresponds to identify the initial state that best aligns the integrated trajectory with the system observation. In MHE setting, the state estimation performance increases with the considered length of the moving horizon but it can become computationally intensive which is a limiting factor for its applicability to fast-varying dynamical systems or on hardware with restricted computational power. Deep Learning (DL) methods can learn solutions to complex optimization problems without incurring any additional online computational cost beyond the inference of the considered architecture. In the context of state estimation we propose to study different type of DL architecture in order to provide full state estimation from partial and noisy system observations. The novel proposed method is based on an end-to-end differentiable formulation of the MHE optimization problem, enabling the offline training of a DL model to provide a state estimation that minimizes the MHE optimization criterion. Once training is completed, state estimations are generated through an explicit relationship learned by the DL model. The proposed method is compared to the online MHE formulation in various case studies, including scenarios with partially observed state and model discrepancies in the context of lateral vehicle dynamics. The results highlight improved state estimation performance both in terms of reduced computational time and accuracy with respect to the online MHE algorithm.
|
|
10:05-10:10, Paper WeBT19.3 | |
A Data-Driven Contact Estimation Method for Wheeled-Biped Robots |
|
Gökbakan, Umit Bora | Inria |
Dümbgen, Frederike | ENS, PSL University |
Caron, Stephane | Inria |
Keywords: Contact Modeling, Legged Robots, Probabilistic Inference
Abstract: Contact estimation is a key ability for limbed robots, where making and breaking contacts has a direct impact on state estimation and balance control. Existing approaches typically rely on gate-cycle priors or designated contact sensors. We design a contact estimator that is suitable for the emerging wheeled-biped robot types that do not have these features. To this end, we propose a Bayes filter in which update steps are learned from real-robot torque measurements while prediction steps rely on inertial measurements. We evaluate this approach in extensive real-robot and simulation experiments. Our method achieves better performance while being considerably more sample efficient than a comparable deep-learning baseline.
|
|
10:10-10:15, Paper WeBT19.4 | |
Simultaneous Ground Reaction Force and State Estimation Via Constrained Moving Horizon Estimation |
|
Kang, Jiarong | University of Wisconsin Madison |
Xiong, Xiaobin | University of Wisconsin Madison |
Keywords: Sensor Fusion, Legged Robots, Humanoid and Bipedal Locomotion
Abstract: Accurate ground reaction force (GRF) estimation can significantly improve the adaptability of legged robots in various real-world applications. For instance, with estimated GRF and contact kinematics, the locomotion control and planning assist the robot in overcoming uncertain terrains. The canonical momentum-based methods, formulated as nonlinear observers, do not fully address the noisy measurements and the dependence between floating-base states and the generalized momentum dynamics. In this paper, we present a simultaneous ground reaction force and state estimation framework for legged robots, which systematically addresses the sensor noise and the coupling between states and dynamics. With the floating base orientation estimated separately, a decentralized Moving Horizon Estimation (MHE) method is implemented to fuse the robot dynamics, proprioceptive sensors, exteroceptive sensors, and deterministic contact complementarity constraints in a convex windowed optimization. The proposed method is shown to be capable of providing accurate GRF and state estimation on several legged robots, including the custom-designed humanoid robot Bucky, the open-source educational planar bipedal robot STRIDE, and the quadrupedal robot Unitree Go1, with a frequency of 200Hz and a past time window of 0.04s.
|
|
10:15-10:20, Paper WeBT19.5 | |
FracGM: A Fast Fractional Programming Technique for Geman-McClure Robust Estimator |
|
Chen, Bang-Shien | National Taiwan Normal University |
Lin, Yu-Kai | MediaTek Inc |
Chen, Jian-Yu | National Central University |
Huang, Chih-Wei | National Central University |
Chern, Jann-Long | National Taiwan Normal University |
Sun, Ching-Cherng | National Central University |
Keywords: Optimization and Optimal Control, Mapping
Abstract: Robust estimation is essential in computer vision, robotics, and navigation, aiming to minimize the impact of outlier measurements for improved accuracy. We present a fast algorithm for Geman-McClure robust estimation, FracGM, leveraging fractional programming techniques. This solver reformulates the original non-convex fractional problem to a convex dual problem and a linear equation system, iteratively solving them in an alternating optimization pattern. Compared to graduated non-convexity approaches, this strategy exhibits a faster convergence rate and better outlier rejection capability. In addition, the global optimality of the proposed solver can be guaranteed under given conditions. We demonstrate the proposed FracGM solver with Wahba's rotation problem and 3-D point-cloud registration along with relaxation pre-processing and projection post-processing. Compared to state-of-the-art algorithms, when the outlier rates increase from 20% to 80%, FracGM shows 53% and 88% lower rotation and translation increases. In real-world scenarios, FracGM achieves better results in 13 out of 18 outcomes, while having a 19.43% improvement in the computation time.
|
|
10:20-10:25, Paper WeBT19.6 | |
Equivariant IMU Preintegration with Biases: A Galilean Group Approach |
|
Delama, Giulio | University of Klagenfurt |
Fornasier, Alessandro | University of Klagenfurt |
Mahony, Robert | Australian National University |
Weiss, Stephan | Universität Klagenfurt |
Keywords: Localization, Sensor Fusion, Visual-Inertial SLAM
Abstract: This letter proposes a new approach for Inertial Measurement Unit (IMU) preintegration, a fundamental building block that can be leveraged in different optimization-based Inertial Navigation System (INS) localization solutions. Inspired by recent advances in equivariant theory applied to biased INSs, we derive a discrete-time formulation of the IMU preintegration on Gal(3) ⋉ gal(3), the left-trivialization of the tangent group of the Galilean group Gal(3). We define a novel preintegration error that geometrically couples the navigation states and the bias leading to lower linearization error. Our method improves in consistency compared to existing preintegration approaches which treat IMU biases as a separate state-space. Extensive validation against state-of-the-art methods, both in simulation and with real-world IMU data, implementation in the Lie++ library, and open-source code are provided.
|
|
10:25-10:30, Paper WeBT19.7 | |
State Estimation for Continuum Multi-Robot Systems on SE(3) |
|
Lilge, Sven | University of Toronto |
Barfoot, Timothy | University of Toronto |
Burgner-Kahrs, Jessica | University of Toronto |
Keywords: Flexible Robots, State Estimation, Sensor Fusion, Parallel Robots
Abstract: In contrast to conventional robots, accurately modeling the kinematics and statics of continuum robots is challenging due to partially unknown material properties, parasitic effects, or unknown forces acting on the continuous body. Consequentially, state estimation approaches that utilize additional sensor information to predict the shape of continuum robots have garnered significant interest. This paper presents a novel approach to state estimation for systems with multiple coupled continuum robots, which allows estimating the shape and strain variables of multiple continuum robots in an arbitrary coupled topology. Simulations and experiments demonstrate the capabilities and versatility of the proposed method, while achieving accurate and continuous estimates for the state of such systems, resulting in average end-effector errors of 3.3 mm and 5.02° depending on the sensor setup. It is further shown, that the approach offers fast computation times of below 10 ms, enabling its utilization in quasi-static real-time scenarios with average update rates of 100-200 Hz. An open-source C++ implementation of the proposed state estimation method is made publicly available to the community.
|
|
WeBT20 |
408 |
Agricultural Automation 1 |
Regular Session |
Chair: Jiang, Yu | Cornell University |
Co-Chair: Carpin, Stefano | University of California, Merced |
|
09:55-10:00, Paper WeBT20.1 | |
IMU Augment Tightly Coupled Lidar-Visual-Inertial Odometry for Agricultural Environments |
|
Hoang, Quoc Hung | Chungbuk National University |
Kim, Gon-Woo | Chungbuk National University |
Keywords: Agricultural Automation, SLAM, Robotics and Automation in Agriculture and Forestry
Abstract: This paper presents a new tightly coupled LiDAR visual-odometry scheme for agricultural autonomous machinery under a structureless environment and the presence of fluctuation uncertainties. By proposing the robust adaptive filter, the effects of unknown disturbances and noises are significantly addressed. In the meantime, the IMU orientation is effectively estimated by the great capability of an error state Kalman filter (ESKF). The IMU attitude estimation is integrated to significantly improve the accuracy of both LiDAR and visual odometry. Hence, the suggested approach obtains the perfect output performance, smooth trajectory, and robustness against uncertainties. Finally, the effectiveness of the proposed LiDAR visual-odometry is confirmed with the real-time experiment of different scenarios.
|
|
10:00-10:05, Paper WeBT20.2 | |
Joint 3D Point Cloud Segmentation Using Real-Sim Loop: From Panels to Trees and Branches |
|
Qiu, Tian | Cornell University |
Du, Ruiming | Cornell University |
Spine, Nikolai | Cornell University |
Cheng, Lailiang | Cornell University |
Jiang, Yu | Cornell University |
Keywords: Robotics and Automation in Agriculture and Forestry, Field Robots, Data Sets for Robotic Vision
Abstract: Modern orchards are planted in structured rows with distinct panel divisions to improve management. Accurate and efficient joint segmentation of point cloud from Panel to Tree and Branch (P2TB) is essential for robotic operations. However, most current segmentation methods focus on single-instance segmentation and depend on a sequence of deep networks to perform joint tasks. This strategy hinders the use of hierarchical information embedded in the data, leading to both error accumulation and increased costs for annotation and computation, which limits its scalability for real-world applications. In this study, we proposed a novel approach that incorporated a Real2Sim L-TreeGen for training data generation and a joint model (J-P2TB) designed for the P2TB task. The J-P2TB model, trained on the generated simulation dataset, was used for joint segmentation of real-world panel point clouds via zero-shot learning. Compared to representative methods, our model outperformed them in most segmentation metrics while using 40% fewer learnable parameters. This Sim2Real result highlighted the efficacy of L-TreeGen in model training and the performance of J-P2TB for joint segmentation, demonstrating its strong accuracy, efficiency, and generalizability for real-world applications. These improvements would not only greatly benefit the development of robots for automated orchard operations but also advance digital twin technology, enabling the facilitation of field robotics across various domains.
|
|
10:05-10:10, Paper WeBT20.3 | |
Energy Efficient Planning for Repetitive Heterogeneous Tasks in Precision Agriculture |
|
Xie, Shuangyu | Texas A&M University |
Goldberg, Ken | UC Berkeley |
Song, Dezhen | Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) |
Keywords: Task Planning, Agricultural Automation, Robotics and Automation in Agriculture and Forestry
Abstract: Robotic weed removal in precision agriculture introduces a repetitive heterogeneous task planning (RHTP) challenge for a mobile manipulator. RHTP has two unique characteristics: 1) an observe-first-and-manipulate-later (OFML) temporal constraint that forces a unique ordering of two different tasks for each target and 2) energy savings from efficient task collocation to minimize unnecessary movements. RHTP can be framed as a stochastic renewal process. According to the Renewal Reward Theorem, the expected energy usage per task cycle is the long-run average. Traditional task and motion planning focuses on feasibility rather than optimality due to the unknown object and obstacle position prior to execution. However, the known target/obstacle distribution in precision agriculture allows minimizing the expected energy usage. For each instance in this renewal process, we first compute task space partition, a novel data structure that computes all possibilities of task multiplexing and its probabilities with robot reachability. Then we propose a region-based set-coverage problem to formulate the RHTP as a mixed-integer nonlinear programming. We have implemented and solved RHTP using Branch-and-Bound solver. Compared to a baseline in simulations based on real field data, the results suggest a significant improvement in path length, number of robot stops, overall energy usage, and number of replans.
|
|
10:10-10:15, Paper WeBT20.4 | |
Leveraging LLMs for Mission Planning in Precision Agriculture |
|
Zuzuarregui, Marcos | University of California, Merced |
Carpin, Stefano | University of California, Merced |
Keywords: Software Tools for Robot Programming, Robotics and Automation in Agriculture and Forestry, Agricultural Automation
Abstract: Robotics and artificial intelligence hold significant potential for advancing precision agriculture. While robotic systems have been successfully deployed for various tasks, adapting them to perform diverse missions remains challenging, particularly because end users often lack technical expertise. In this paper, we present an end-to-end system that leverages large language models (LLMs), specifically ChatGPT, to enable users to assign complex data collection tasks to autonomous robots using natural language instructions. To enhance reusability, mission plans are encoded using an existing IEEE task specification standard, and are executed on robots via ROS2 nodes that bridge high-level mission descriptions with existing ROS libraries. Through extensive experiments, we highlight the strengths and limitations of LLMs in this context, particularly regarding spatial reasoning and solving complex routing challenges, and show how our proposed implementation overcomes them.
|
|
10:15-10:20, Paper WeBT20.5 | |
Hierarchical Tri-Manual Planning for Vision-Assisted Fruit Harvesting with Quadrupedal Robots |
|
Liu, Zhichao | University of California, Riverside |
Zhou, Jingzong | University of California, Riverside |
Karydis, Konstantinos | University of California, Riverside |
Keywords: Robotics and Automation in Agriculture and Forestry, Field Robots, Bimanual Manipulation
Abstract: This paper addresses the challenge of developing a multi-arm quadrupedal robot capable of efficiently harvesting fruit in complex, natural environments. To overcome the inherent limitations of traditional bimanual manipulation, we introduce the first three-arm quadrupedal robot LocoHarv-3, that builds on top of the Spot quadruped, and propose a novel hierarchical tri-manual planning approach for automated fruit harvesting with collision-free trajectories between the built-in end-effector of Spot and our custom-made bimanual manipulator. Our comprehensive semi-autonomous framework integrates teleoperation, supported by LiDAR-based odometry and mapping, with learning-based visual perception for accurate fruit detection and pose estimation. Validation is conducted through a series of controlled indoor experiments using motion capture and extensive field tests in natural settings. Results demonstrate a 90% success rate in in-lab settings with a single attempt, and field trials further verify the system's robustness and efficiency in more challenging real-world environments.
|
|
10:20-10:25, Paper WeBT20.6 | |
Capacitated Agriculture Fleet Vehicle Routing with Implements and Limited Autonomy: A Model and a Two-Phase Solution Approach |
|
Lopez-Sanchez, Aitor | Universidad Rey Juan Carlos |
Lujak, Marin | University Rey Juan Carlos |
Semet, Frederic | Centrale Lille |
Billhardt, Holger | Universidad Rey Juan Carlos |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Agent-Based Systems, Robotics and Automation in Agriculture and Forestry
Abstract: In this paper, we study the vehicle routing problem (VRP) for a fleet of cooperative autonomous agricultural robots (agribots) equipped with detachable implements, with the goal of efficiently and sustainably completing agricultural tasks in precision crop farming. State of the art in the area of agribot fleet routing with detachable implements is lacking. Consequently, we propose the Capacitated Agriculture Fleet Vehicle Routing Problem with Implements and Limited Autonomy (CAFVRPILA), designed to optimize the agribot fleet's routes across a set of given agricultural tasks while considering implement capacities, agribot-implement compatibilities, and agribots' limited battery autonomies. A heuristic two-phase decomposition approach is proposed for this problem. Simulation experiments show that minimizing travel distances and costs with CAFVRPILA enhances sustainable farming while maximizing productivity and resource use. The results also demonstrate that synchronizing multiple operations improves efficiency, particularly in larger fleets.
|
|
10:25-10:30, Paper WeBT20.7 | |
Towards Closing the Loop in Robotic Pollination for Indoor Farming Via Autonomous Microscopic Inspection |
|
Kong, Chuizheng | Georgia Institute of Technology |
Qiu, Alex | Georgia Institute of Technology |
Wibowo, Idris | Georgia Institute of Technology |
Ren, Marvin | Georgia Institute of Technology |
Dhori, Aishik | Georgia Institute of Technology |
Ling, Kai-Shu | United States Department of Agriculture - Agricultural Research |
Hu, Ai-Ping | Georgia Tech Research Institute |
Kousik, Shreyas | Georgia Institute of Technology |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Automation at Micro-Nano Scales
Abstract: Effective pollination is a key challenge for indoor farming, since bees struggle to navigate without the sun. While a variety of robotic system solutions have been proposed, it remains difficult to autonomously check that a flower has been sufficiently pollinated to produce high-quality fruit, which is especially critical for self-pollinating crops such as strawberries. To this end, this work proposes a novel robotic system for indoor farming. The proposed hardware combines a 7-degree-of-freedom (DOF) manipulator arm with a custom end-effector, comprised of an endoscope camera, a 2-DOF microscope subsystem, and a custom vibrating pollination tool; this is paired with algorithms to detect and estimate the pose of strawberry flowers, navigate to each flower, pollinate using the tool, and inspect with the microscope. The key novelty is vibrating the flower from below while simultaneously inspecting with a microscope from above. Each subsystem is validated via extensive experiments.
|
|
WeBT21 |
410 |
Optimization and Optimal Control |
Regular Session |
Co-Chair: Mastalli, Carlos | Heriot-Watt University |
|
09:55-10:00, Paper WeBT21.1 | |
Embedded Robust Model Predictive Path Integral Control Using Sensitivity Tubes and GPU Acceleration |
|
Falk Nyboe, Frederik | University of Southern Denmark |
Afifi, Amr | University of Twente |
Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
Ebeid, Emad | University of Southern Denmark |
Franchi, Antonio | University of Twente / Sapienza University of Rome |
Keywords: Optimization and Optimal Control, Aerial Systems: Mechanics and Control, Embedded Systems for Robotic and Automation
Abstract: This paper proposes a method to robustify model predictive path integral (MPPI) control by directly taking into account the effects of parameter uncertainty into the controller formulation. Leveraging the recent notion of closed-loop state sensitivity, the proposed MPPI can consider the state sensitivity against parameter mismatch as a part of the system state, and consequently exploit this additional information to address the challenge of model mismatch in sampling-based model predictive control. Using an obstacle avoidance scenario, we demonstrate the use of our approach to control an aerial robot. We present an embedded implementation of our method, utilizing parallelization of computations on a GPU. Finally, we show the increased robustness of our approach over a standard MPPI controller through hardware-in-the-loop simulations and validate its embedded real-time properties.
|
|
10:00-10:05, Paper WeBT21.2 | |
Guided Bayesian Optimization: Data-Efficient Controller Tuning with Digital Twin (I) |
|
Nobar, Mahdi | ETH Zurich |
Keller, Jürg | FHNW |
Rupenyan, Alisa | Zurich University of Applied Sciences |
Khosravi, Mohammad | TU Delft |
Lygeros, John | ETH Zurich |
Keywords: Optimization and Optimal Control, Calibration and Identification, Incremental Learning
Abstract: This article presents the guided Bayesian optimization (BO) algorithm as an efficient data-driven method for iteratively tuning closed-loop controller parameters using a digital twin of the system. The digital twin is built using closed-loop data acquired during standard BO iterations, and activated when the uncertainty in the Gaussian Process model of the optimization objective on the real system is high. We define a controller tuning framework independent of the controller or the plant structure. Our proposed methodology is model-free, making it suitable for nonlinear and unmodelled plants with measurement noise. The objective function consists of performance metrics modeled by Gaussian processes. We utilize the available information in the closed-loop system to progressively maintain a digital twin that guides the optimizer, improving the data efficiency of our method. Switching the digital twin on and off is triggered by our data-driven criteria related to the digital twin's uncertainty estimations in the BO tuning framework. Effectively, it replaces much of the exploration of the real system with exploration performed on the digital twin. We analyze the properties of our method in simulation and demonstrate its performance on two real closed-loop systems with different plant and controller structures. The experimental results show that our method requires fewer experiments on the physical plant than Bayesian optimization to find the optimal controller parameters.
|
|
10:05-10:10, Paper WeBT21.3 | |
Enhancing Robotic System Robustness Via Lyapunov Exponent-Based Optimization |
|
Fadini, Gabriele | ETHZ |
Coros, Stelian | ETH Zurich |
Keywords: Optimization and Optimal Control, Dynamics, Legged Robots
Abstract: We present a novel differentiable approach to quantifying and optimizing stability in robotic systems addressing an open challenge in the field of robot analysis, control,design, and optimization. Our method leverages differentiable simulation over extended time horizons to estimate a robustness metric based on the Lyapunov exponents. The proposed metric offers several properties, including a natural extension to limit cycles (commonly encountered in robotics tasks and locomotion)and independence from the trajectory path for states converging to the attractor. We showcase, with an textit{ad-hoc} JAX gradient-based optimization framework, remarkable flexibility in tackling the robustness challenge. Our approach is tested through diverse scenarios of varying complexity, encompassing high-degree-of-freedom systems and contact-rich environments. The positive outcomes across these cases highlight the potential of our method in quantifying and possibly enhancing system robustness.
|
|
10:10-10:15, Paper WeBT21.4 | |
Endpoint-Explicit Differential Dynamic Programming Via Exact Resolution |
|
Parilli, Maria | Universidad Simón Bolívar |
Martinez, Sergi | Heriot-Watt |
Mastalli, Carlos | Heriot-Watt University |
Keywords: Optimization and Optimal Control, Multi-Contact Whole-Body Motion Planning and Control, Formal Methods in Robotics and Automation
Abstract: We introduce a novel method for handling endpoint constraints in constrained differential dynamic programming (DDP). Unlike existing approaches, our method guarantees quadratic convergence and is exact, effectively managing rank deficiencies in both endpoint and stagewise equality constraints. It is applicable to both forward and inverse dynamics formulations, making it particularly well-suited for model predictive control (MPC) applications and for accelerating optimal control (OC) solvers. We demonstrate the efficacy of our approach across a broad range of robotics problems and provide a user-friendly open-source implementation within CROCODDYL.
|
|
10:15-10:20, Paper WeBT21.5 | |
Second-Order Stein Variational Dynamic Optimization |
|
Aoyama, Yuichiro | Georgia Institute of Technology |
Lehmann, Peter | Georgia Institute of Technology |
Theodorou, Evangelos | Georgia Institute of Technology |
Keywords: Optimization and Optimal Control, Constrained Motion Planning, Motion and Path Planning
Abstract: We present a novel second-order trajectory optimization algorithm based on Stein Variational Newton's Method and Maximum Entropy Differential Dynamic Programming. The proposed algorithm, called Stein Variational Differential Dynamic Programming, is a kernel-based extension of Maximum Entropy Differential Dynamic Programming that combines the best of the two worlds of sampling-based and gradient-based optimization. The resulting algorithm avoids known drawbacks of gradient-based dynamic optimization in terms of getting stuck at local minima, while it overcomes limitations of sampling-based stochastic optimization in terms of introducing undesirable stochasticity when applied in online fashion. To test the efficacy of the proposed algorithm, experiments are conducted in Model Predictive Control mode. The experiments include comparisons with unimodal and multimodal Maximum Entropy Differential Dynamic Programming as well as Model Predictive Path Integral Control and its multimodal and Stein Variational extensions. The results demonstrate the superior performance of the proposed algorithms and confirm the hypothesis that there is a middle ground between sampling- and gradient-based optimization that is indeed beneficial for dynamic optimization.
|
|
10:20-10:25, Paper WeBT21.6 | |
Application of Koopman Direct Encoding-Based Model Predictive Control to Nonlinear Electromechanical Systems |
|
Park, Sungbin | Korea Advanced Institute of Science and Technology |
Kim, Won Dong | Korea Advanced Institute of Science & Technology (KAIST) |
Jeon, Sangha | Korea Advanced Institute of Science and Technology(KAIST) |
Kim, Jung | KAIST |
Keywords: Optimization and Optimal Control, Dynamics, Contact Modeling
Abstract: The Koopman operator framework has shown promising results in enabling the analysis of nonlinear dynamics into an infinite-dimensional linear representation. Koopman direct encoding (KDE) is a model-based approach that utilizes inner products and compositions in a Hilbert space to compute the Koopman operator. However, it has primarily been applied to autonomous systems and simulation environments. Here, we extend the application of KDE to nonautonomous systems and real-world environments by introducing Koopman direct encoding-based model predictive control (KDE-MPC). It was validated on nonlinear electromechanical systems with segmented dynamic conditions, such as contact-noncontact transitions, which pose challenges for modeling and control. Simulation results demonstrate a more stable and smoother position profile compared to proportional-integral-derivative control, particularly at discontinuous boundaries. KDE-MPC was also applied to real-world systems, achieving similar position tracking performance to simulation results. We anticipate that KDE-MPC will offer a viable solution for complex robotic control challenges.
|
|
10:25-10:30, Paper WeBT21.7 | |
Effective Search for Control Hierarchies within the Policy Decomposition Framework |
|
Khadke, Ashwin | The AI Institute |
Geyer, Hartmut | Carnegie Mellon University |
Keywords: Optimization and Optimal Control, Evolutionary Robotics, Reinforcement Learning
Abstract: Policy decomposition is a novel framework for approximating optimal control policies of complex dynamical systems with a hierarchy of policies derived from smaller but tractable subsystems. It stands out amongst the class of hierarchical control methods by estimating a priori how well the closed-loop behavior of different control hierarchies matches the optimal policy. However, the number of possible hierarchies grows prohibitively with the number of inputs and the dimension of the state-space of the system making it unrealistic to estimate the closed-loop performance for all hierarchies. Here, we present the development of two search methods based on Genetic Algorithm and Monte-Carlo Tree Search to tackle this combinatorial challenge, and demonstrate that it is indeed surmountable. We showcase the efficacy of our search methods and the generality of the framework by applying it towards finding hierarchies for control of three distinct robotic systems: a simplified biped, a planar manipulator, and a quadcopter. The discovered hierarchies, in comparison to heuristically designed ones, provide improved closed-loop performance or can be computed in minimal time with marginally worse control performance, and also exceed the control performance of policies obtained with popular deep reinforcement learning methods.
|
|
WeBT22 |
411 |
Learning Based Planning for Manipulation 2 |
Regular Session |
Chair: Choi, Changhyun | University of Minnesota, Twin Cities |
|
09:55-10:00, Paper WeBT22.1 | |
Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects |
|
Scheikl, Paul Maria | Johns Hopkins University |
Schreiber, Nicolas | Karlsruhe Institute of Technology (KIT) |
Haas, Christoph | Karlsruhe Institute of Technology (KIT) |
Freymuth, Niklas | Karlsruhe Institute of Technology |
Neumann, Gerhard | Karlsruhe Institute of Technology |
Lioutikov, Rudolf | Karlsruhe Institute of Technology |
Mathis-Ullrich, Franziska | Friedrich-Alexander-University Erlangen-Nurnberg (FAU) |
Keywords: Surgical Robotics: Laparoscopy, Imitation Learning
Abstract: Policy learning in robot-assisted surgery (RAS) lacks data efficient and versatile methods that exhibit the desired motion quality for delicate surgical interventions. To this end, we introduce Movement Primitive Diffusion (MPD), a novel method for imitation learning (IL) in RAS that focuses on gentle manipulation of deformable objects. The approach combines the versatility of diffusion-based imitation learning (DIL) with the high-quality motion generation capabilities of Probabilistic Dynamic Movement Primitives (ProDMPs). This combination enables MPD to achieve gentle manipulation of deformable objects, while maintaining data efficiency critical for RAS applications where demonstration data is scarce. We evaluate MPD across various simulated and real world robotic tasks on both state and image observations. MPD outperforms state-of-the-art DIL methods in success rate, motion quality, and data efficiency.
|
|
10:00-10:05, Paper WeBT22.2 | |
Sim-Grasp: Learning 6-DOF Grasp Policies for Cluttered Environments Using a Synthetic Benchmark |
|
Li, Juncheng | Purdue University |
Cappelleri, David | Purdue University |
Keywords: Mobile Manipulation, Deep Learning in Grasping and Manipulation, Grasping
Abstract: In this paper, we present Sim-Grasp, a robust 6-DOF two-finger grasping system that integrates advanced language models for enhanced object manipulation in cluttered environments. We introduce the Sim-Grasp-Dataset, which includes 1,550 objects across 500 scenarios with 7.9 million annotated labels, and develop Sim-GraspNet to generate grasp poses from point clouds. The Sim-Grasp-Polices achieve grasping success rates of 97.14% for single objects and 87.43% and 83.33% for mixed clutter scenarios of Levels 1-2 and Levels 3-4 objects, respectively. By incorporating language models for target identification through text and box prompts, Sim-Grasp enables both object-agnostic and target picking, pushing the boundaries of intelligent robotic systems.
|
|
10:05-10:10, Paper WeBT22.3 | |
Controlled Robot Language with Frame Semantics (FrameCRL) for Autonomous Context-Aware High-Level Planning |
|
Tran, Dang | University of Alabama |
Yan, Fujian | Wichita State University |
Zhang, Qiang | The University of Alabama |
Zhang, Yinlong | Shenyang Institute of Automation, Chinese Academy of Sciences |
He, Hongsheng | The University of Alabama |
Keywords: AI-Based Methods, Human-Robot Collaboration, Dual Arm Manipulation
Abstract: This paper proposes a configurable and scalable framework based on Controlled Robot Language with Frame Semantics (FrameCRL) for plan generation. Given natural language instructions, FrameCRL constructs an equivalent formal semantic formulation in the form of discourse representation structures (DRS). Imperative verbs are extracted from the semantic structures as keys to anchor relevant semantic frames from FrameNet, and the selected semantic frames are used to construct goal statements in planning language. Non-imperative statements are further analyzed to generate object specifications and the initial state of the planning problem. These generated statements are then merged into a single planning script, which can be solved directly by the integrated planner. The performance of FrameCRL was evaluated on various natural language corpora and compared with large language models (LLM) based methods in plan generation. The results demonstrated the outperformance of FrameCRL in generating high-quality plans and its capability to handle large context scenarios. The FrameCRL was also tested on pick-and-place tasks using a dual-arm robot and it showcased a robust performance in linguistic understanding.
|
|
10:10-10:15, Paper WeBT22.4 | |
Effective Tuning Strategies for Generalist Robot Manipulation Policies |
|
Zhang, Wenbo | University of Adelaide |
Li, Yang | Commonwealth Scientific and Industrial Research Organisation |
Qiao, Yanyuan | The University of Adelaide |
Huang, Siyuan | Shanghai Jiao Tong University |
Liu, Jiajun | CSIRO |
Dayoub, Feras | The University of Adelaide |
Ma, Xiao | Dyson |
Liu, Lingqiao | University of Adelaide |
Keywords: Deep Learning in Grasping and Manipulation, Transfer Learning
Abstract: Generalist robot manipulation policies (GMPs) have the potential to generalize across a wide range of tasks, environments, and devices. However, existing policies continue to struggle with out-of-distribution scenarios, considering that the action data remains notoriously hard to collect. While fine-tuning offers a practical way to quickly adapt a GMP to novel domains and tasks with limited samples, we observe that the performance of the resulting GMP differs significantly with respect to the design choices of fine-tuning strategies. In this work, we first conduct an in-depth empirical study to investigate the effect of key factors in GMP fine-tuning strategies, covering the action space, policy head, and the choice of tunable parameters, where over 2,500 rollouts are evaluated for a single configuration. We systematically discuss and summarize our findings and identify the key design choices, which we believe give a practical guideline for GMP fine-tuning. We observe that in a low-data regime, with carefully chosen fine-tuning strategies, a GMP significantly outperforms the state-of-the-art imitation learning algorithms. The results presented in this work establish a new baseline for future studies on fine-tuned GMPs.
|
|
10:15-10:20, Paper WeBT22.5 | |
RM-Planner: Integrating Reinforcement Learning with Whole-Body Model Predictive Control for Mobile Manipulation |
|
Zhuang, Zixuan | Sun Yat-Sen University |
Zheng, Le | Sun Yat-Sen University |
Li, Wanyue | The University of Hong Kong |
Liu, Renming | Sun Yat-Sen University |
Lu, Peng | The University of Hong Kong |
Cheng, Hui | Sun Yat-Sen University |
Keywords: Mobile Manipulation, AI-Enabled Robotics, Service Robotics
Abstract: Mobile manipulation is a crucial problem in various real-world applications. However, existing methods have demonstrated unsatisfactory training efficiency and sparse rewards, requiring complex coordination strategies between the mobile base and arm. In this paper, we propose RM-Planner, a planning method for mobile manipulation tasks in unknown complex environments. By adopting a two-layer hierarchical framework, we utilize a whole-body Model Predictive Control (MPC)-based low-level planner to track subgoals and generate aggressive but safe joint commands throughout the entire manipulation process, while a Reinforcement Learning (RL)-based high-level policy directly uses 3D point cloud representations of the environment, guiding the robot to achieve optimal manipulation postures based on current observations and specific task objectives. We conduct extensive simulations and real-world experiments, where RM-planner significantly outperforms state-of-the-art methods. Our code will be released at href{https://github.com/SYSU-RoboticsLab/RM-Planner.git}{h ttps://github.com/SYSU-RoboticsLab/RM-Planner.git}.
|
|
10:20-10:25, Paper WeBT22.6 | |
Routing Manipulation of Deformable Linear Object Using Reinforcement Learning and Diffusion Policy |
|
Li, Mingen | University of Minnesota Twin Cities |
Yu, Houjian | University of Minnesota, Twin Cities |
Choi, Changhyun | University of Minnesota, Twin Cities |
Keywords: Deep Learning in Grasping and Manipulation, Reinforcement Learning, Imitation Learning
Abstract: Tasks involving deformable linear objects (DLOs) are prevalent in daily life but pose significant challenges due to their infinite degrees of freedom and underactuated nature. Frequent contact between DLOs and surrounding objects with unknown physical parameters, such as friction, further complicates their manipulation. Performing tasks like routing ropes through a hole requires gentle yet robust manipulation, making it particularly challenging. Previous research has not adequately addressed general DLO manipulation tasks that involve intensive contact, especially in environments with rough surfaces. This paper presents a robust and delicate manipulation learning approach for the DLO routing task, leveraging reinforcement learning (RL) and diffusion policy. First, reinforcement learning agents are trained separately for rope insertion and pulling. During training, the agents are encouraged to minimize rope tension throughout task execution in environments with randomized friction to achieve delicate motion. Next, the rollouts from these agents are collected as expert demonstrations to train a diffusion policy. Our approach generates delicate motions to prevent the rope from being damaged or getting stuck on rough surfaces while remaining robust against environmental disturbances. Please refer to our project page: https://lmeee.github.io/DLOPull/
|
|
10:25-10:30, Paper WeBT22.7 | |
TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image |
|
Wang, Haoxiao | Tianjin University of Technology |
Zhou, Kaichen | University of Oxford |
Gu, Binrui | Peiking University |
Feng, ZhiYuan | Tsinghua University |
Wang, Weijie | Zhejiang University |
Sun, Peilin | Zhejiang University |
Xiao, Yicheng | Southeast University |
Zhang, Jianhua | Tianjin University of Technology |
Dong, Hao | Peking University |
Keywords: AI-Based Methods, Grippers and Other End-Effectors
Abstract: Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we, for the first time, propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on: url{https://transdiff.github.io/}
|
|
WeBT23 |
412 |
Autonomous Vehicle Perception 4 |
Regular Session |
Chair: Valada, Abhinav | University of Freiburg |
Co-Chair: Ding, Wenchao | Fudan University |
|
09:55-10:00, Paper WeBT23.1 | |
Efficient Submap-Based Autonomous MAV Exploration Using Visual-Inertial SLAM Configurable for LiDARs or Depth Cameras |
|
Papatheodorou, Sotiris | Imperial College London |
Boche, Simon | Technical University of Munich |
Barbas Laina, Sebastián | TU Munich |
Leutenegger, Stefan | Technical University of Munich |
Keywords: Aerial Systems: Perception and Autonomy, Reactive and Sensor-Based Planning
Abstract: Autonomous exploration of unknown space is an essential component for the deployment of mobile robots in the real world. Safe navigation is crucial for all robotics applications and requires accurate and consistent maps of the robot's surroundings. To achieve full autonomy and allow deployment in a wide variety of environments, the robot must rely on on-board state estimation which is prone to drift over time. We propose a Micro Aerial Vehicle (MAV) exploration framework based on local submaps to allow retaining global consistency by applying loop-closure corrections to the relative submap poses. To enable large-scale exploration we efficiently compute global, environment-wide frontiers from the local submap frontiers and use a sampling-based next-best-view exploration planner. Our method seamlessly supports using either a LiDAR sensor or a depth camera, making it suitable for different kinds of MAV platforms. We perform comparative evaluations in simulation against a state-of-the-art submap-based exploration framework to showcase the efficiency and reconstruction quality of our approach. Finally, we demonstrate the applicability of our method to real-world MAVs, one equipped with a LiDAR and the other with a depth camera.
|
|
10:00-10:05, Paper WeBT23.2 | |
Parking-SG: Open-Vocabulary Hierarchical 3D Scene Graph Representation for Open Parking Environments |
|
Zhang, Yaowen | Beijing Institute of Technology |
Ruan, Yi | Beijing Institute of Technology |
Pan, Miaoxin | Beijing Institute of Technology |
Yang, Yi | Beijing Institute of Technology |
Fu, Mengyin | Beijing Institute of Technology |
Keywords: Automation Technologies for Smart Cities, Mapping, Semantic Scene Understanding
Abstract: Automatic Valet Parking (AVP) has garnered significant attention from industry and academia due to its potential to enhance traffic efficiency, parking safety, and user experience. While AVP technologies have been successfully applied in standard parking scenarios with clear markings, real-world parking environments are far more diverse and complex, posing challenges for current systems. To address these limitations, we present Parking-SG, an open-vocabulary hierarchical 3D scene graph representation, facilitating the application of AVP in open and complex environments. Our approach builds an object-based, open-vocabulary map that integrates both ground-level and ground-above objects for comprehensive environmental understanding. Leveraging common sense reasoning and object behavior relationships, various standard or non-standard parking spaces are inferred in open environments. Additionally, we extract and analyze path topology to construct a hierarchical map representation, supporting complex AVP tasks. Parking-SG is validated in both simulated and real-world environments, demonstrating its ability to generate rich environmental representations, accurately and flexibly infer parking spaces, and effectively perform complex AVP tasks.
|
|
10:05-10:10, Paper WeBT23.3 | |
3D Lane Detection Based on Projection-Consistent Reference Points and Intra & Inter-Lane Context |
|
Bing, Yiqiu | Capital Normal University |
Niu, Huilin | Capital Normal University |
Zhang, Hong | Sensetime |
Jiang, Na | Capital Normal University |
Zhou, Zhong | BeiHang University |
Geng, Qichuan | Capital Normal University |
Keywords: Computer Vision for Transportation, Deep Learning for Visual Perception
Abstract: 3D lane detection aims to identify lane categories and trends in 3D space, which is a vital and challenging task in autonomous driving. Existing methods introduce various priors to guide 3D lane prediction, which generally consist of a series of reference points for context aggregation. However, due to the misalignment between these reference points and the lanes, it is difficult to obtain complete and discriminative context for complex instances. In this paper, we are devoted to introducing 3D priors adaptive to lane appearances, which serve as references to aggregate the lane context. Specifically, we propose a projection-consistent reference generation strategy to keep the projected 3D reference points geometrically consistent with the corresponding lanes in images. In addition, a segmentation-lifting denoising strategy is designed to improve the ability of the model to map the lane segmentation into 3D space. To leverage more lane-related information, we propose a decoupled lane-context aggregation module by considering the perspectives of individual geometries and integrated layout, namely intra-lane and inter-lane context. Extensive experiments on the OpenLane dataset show that our approach outperforms previous methods and achieves the state-of-the-art performance. The code will be made publicly available.
|
|
10:10-10:15, Paper WeBT23.4 | |
Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model |
|
Zhang, Ludan | Nankai University |
Ding, Xiaokang | School of Electronic and Information Engineering, Beijing Univer |
Dai, Yuqi | Tsinghua University |
He, Lei | Tsinghua University |
Li, Keqiang | Tsinghua University |
Keywords: Computer Vision for Automation, Deep Learning Methods, Representation Learning
Abstract: End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird’s-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.
|
|
10:15-10:20, Paper WeBT23.5 | |
Panoptic-Depth Forecasting |
|
Juana Valeria, Hurtado | University of Freiburg |
Mohan, Riya | Freiburg University |
Valada, Abhinav | University of Freiburg |
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Visual Learning
Abstract: Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting future panoptic segmentation and depth maps from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of future frames in a coherent manner. Furthermore, we present two baselines and propose the novel netname architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of netname across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at https://pdcast.cs.uni-freiburg.de
|
|
10:20-10:25, Paper WeBT23.6 | |
Coarse-To-Fine Cross-Modality Generation for Enhancing Vehicle Re-Identification with High-Fidelity Synthetic Data |
|
Jin, Leyang | National University of Singapore |
Ji, Wei | Nanjing University |
Chua, Tatseng | National University of Singapore |
Zheng, Zhedong | University of Macau |
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems
Abstract: Due to the critical issues of privacy and partial occlusion, license plate information is not always available in vehicle recognition systems. Consequently, researchers have increasingly turned towards vehicle re-identification (reID) techniques to bridge the gap between cross-view camera systems. Despite the growing interest, one major challenge persists: the scarcity of authentic, large-scale training datasets. To address this challenge, this paper introduces a coarse-to-fine generation pipeline designed to synthesize high-fidelity vehicle data, thereby facilitating subsequent vehicle representation learning. Specifically, the proposed approach consists of three stages: Prompt Processing, Diffusion Fine-tuning, and Semantic Filtering. First, we collect detailed prompts from vehicle websites and companies with fine-grained vehicle prototype attributes. Next, we leverage the prior knowledge of these automotive prototypes to fine-tune diffusion models. Finally, to ensure the quality of the synthesized data, we employ pre-trained vision-language models to filter out substandard images. Building upon the high-quality data generated by this pipeline, we validate the effectiveness using vanilla models. Extensive experimental evaluations demonstrate that our approach achieves competitive accuracy on public benchmarks such as VeRi-776, VehicleID and CityFlowV2, and is compatible with various model architectures.
|
|
10:25-10:30, Paper WeBT23.7 | |
HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes |
|
Wu, Ke | Fudan University |
Zhang, Kaizhao | Fudan University |
Zhang, Zhiwei | Fudan University |
Tie, Muer | Fudan University |
Yuan, Shanshuai | Fudan University |
Zhao, Jieru | Shanghai Jiao Tong University |
Gan, Zhongxue | Fudan University |
Ding, Wenchao | Fudan University |
Keywords: Mapping, RGB-D Perception, Sensor Fusion
Abstract: Online dense mapping of urban scenes forms a fundamental cornerstone for scene understanding and naviga- tion of autonomous vehicles. Recent advancements in dense mapping methods are mainly based on NeRF, whose rendering speed is too slow to meet online requirements. 3D Gaussian Splatting (3DGS), with its rendering speed hundreds of times faster than NeRF, holds greater potential in online dense mapping. However, integrating 3DGS into a street-view dense mapping framework still faces two challenges, including incom- plete reconstruction due to the absence of geometric information beyond the LiDAR coverage area and extensive computation for reconstruction in large urban scenes. To this end, we propose HGS-Mapping, an online dense mapping framework in unbounded large-scale scenes. To attain complete construction, our framework introduces Hybrid Gaussian Representation, which models different parts of the entire scene using Gaussians with distinct properties. Furthermore, we employ a hybrid Gaussian initialization mechanism and an adaptive update method to achieve high-fidelity and rapid reconstruction. To the best of our knowledge, we are the first to integrate Gaussian representation into online dense mapping of urban scenes. Our approach achieves SOTA reconstruction accuracy while only employing 66% number of Gaussians, leading to 20% faster reconstruction speed.
|
|
WeCT2 |
301 |
Interactive Robot Learning |
Regular Session |
Chair: Losey, Dylan | Virginia Tech |
Co-Chair: Zhou, Bolei | University of California, Los Angeles |
|
11:15-11:20, Paper WeCT2.1 | |
Personalizing Interfaces to Humans with User-Friendly Priors |
|
Christie, Benjamin | Virginia Tech |
Nemlekar, Heramb | Virginia Tech |
Losey, Dylan | Virginia Tech |
Keywords: Human-Robot Collaboration, Probabilistic Inference, Virtual Reality and Interfaces
Abstract: Robots often need to convey information to human users. For example, robots can leverage visual, auditory, and haptic interfaces to display their intent or express their internal state. In some scenarios there are socially agreed upon conventions for what these signals mean: e.g., a red light indicates an autonomous car is slowing down. But as robots develop new capabilities and seek to convey more complex data, the meaning behind their signals is not always mutually understood: one user might think a flashing light indicates the autonomous car is an aggressive driver, while another user might think the same signal means the autonomous car is defensive. In this paper we enable robots to adapt their interfaces to the current user so that the human's personalized interpretation is aligned with the robot's meaning. We start with an information theoretic end-to-end approach, which automatically tunes the interface policy to optimize the correlation between human and robot. But to ensure that this learning policy is intuitive --- and to accelerate how quickly the interface adapts to the human --- we recognize that humans have priors over how interfaces should function. For instance, humans expect interface signals to be proportional and convex. Our approach biases the robot's interface towards these priors, resulting in signals that are adapted to the current user while still following social expectations. Our simulations and user study results across 15 participants suggest that these priors improve robot-to-human communication. See videos here: https://youtu.be/Re3OLg57hp8.
|
|
11:20-11:25, Paper WeCT2.2 | |
Personalization in Human-Robot Interaction through Preference-Based Action Representation Learning |
|
Wang, Ruiqi | Purdue University |
Zhao, Dezhong | Beijing University of Chemical Technology |
Suh, Dayoon | Purdue University |
Yuan, Ziqin | Purdue University |
Chen, Guohua | Beijing University of Chemical Technology |
Min, Byung-Cheol | Purdue University |
Keywords: Human-Centered Robotics, Representation Learning, Human Factors and Human-in-the-Loop
Abstract: Preference-based reinforcement learning (PbRL) has shown significant promise for personalization in human-robot interaction (HRI) by explicitly integrating human preferences into the robot learning process. However, existing practices often require training a personalized robot policy from scratch, resulting in inefficient use of human feedback. In this paper, we propose preference-based action representation learning (PbARL), an efficient fine-tuning method that decouples common task structure from preference by leveraging pre-trained robot policies. Instead of directly fine-tuning the pre-trained policy with human preference, PbARL uses it as a reference for an action representation learning task that maximizes the mutual information between the pre-trained source domain and the target user preference-aligned domain. This approach allows the robot to personalize its behaviors while preserving original task performance and eliminates the need for extensive prior information from the source domain, thereby enhancing efficiency and practicality in real-world HRI scenarios. Empirical results on the Assistive Gym benchmark and a real-world user study (N=8) demonstrate the benefits of our method compared to state-of-the-art approaches. Website at https://sites.google.com/view/pbarl.
|
|
11:25-11:30, Paper WeCT2.3 | |
Interface Matters: Comparing First and Third-Person Perspective Interfaces for Bi-Manual Robot Behavioural Cloning |
|
Luo, Haining | Imperial College London |
Chacon Quesada, Rodrigo | Imperial College London |
Casado, Fernando E. | Imperial College London |
Lingg, Nico | Imperial College London |
Demiris, Yiannis | Imperial College London |
Keywords: Virtual Reality and Interfaces, Bimanual Manipulation, Learning from Demonstration
Abstract: Despite the growing interest in Behavioural Cloning for robots, few existing research has explicitly explored the impact of user interfaces on the effectiveness of expert demonstrations. We investigate the importance of user interface design in Behavioural Cloning, highlighting the critical role that interfaces play in conveying human demonstrations and robotics capabilities. This study compares the effectiveness of first and third-person perspective interfaces for robot shoe- lacing, a highly dexterous, bi-manual manipulation task that involves deformable objects and requires high precision. Our study highlights the importance of considering the impact of interface design on expert demonstration quality in Behavioural Cloning applications. By providing a first-person perspective, we observed significant differences in demonstration execution time and consistency compared to the third-person perspective. These findings suggest that the choice of interface can influence the quality of expert demonstrations, which in turn affects the performance of learning algorithms.
|
|
11:30-11:35, Paper WeCT2.4 | |
Robot Policy Transfer with Online Demonstrations: An Active Reinforcement Learning Approach |
|
Hou, Muhan | Vrije University Amsterdam |
Hindriks, Koen | Vrije Universiteit Amsterdam |
Eiben, A.E. | VU Amsterdam |
Baraka, Kim | Vrije Universiteit Amsterdam |
Keywords: Human Factors and Human-in-the-Loop, Learning from Demonstration, Transfer Learning
Abstract: Transfer Learning (TL) is a powerful tool that enables robots to transfer learned policies across different environments, tasks, or embodiments. To further facilitate this process, efforts have been made to combine it with Learning from Demonstrations (LfD) for more flexible and efficient policy transfer. However, these approaches are almost exclusively limited to offline demonstrations collected before policy transfer starts, which may suffer from the intrinsic issue of covariance shift brought by LfD and harm the performance of policy transfer. Meanwhile, extensive work in the learning-from-scratch setting has shown that online demonstrations can effectively alleviate covariance shift and lead to better policy performance with improved sample efficiency. This work combines these insights to introduce online demonstrations into a policy transfer setting. We present Policy Transfer with Online Demonstrations, an active LfD algorithm for policy transfer that can optimize the timing and content of queries for online episodic expert demonstrations under a limited demonstration budget. We evaluate our method in eight robotic scenarios, involving policy transfer across diverse environment characteristics, task objectives, and robotic embodiments, with the aim to transfer a trained policy from a source task to a related but different target task. The results show that our method significantly outperforms all baselines in terms of average success rate and sample efficiency, compared to two canonical LfD methods with offline demonstrations and one active LfD method with online demonstrations. Additionally, we conduct preliminary sim-to-real tests of the transferred policy on three transfer scenarios in the real-world environment, demonstrating the policy effectiveness on a real robot manipulator.
|
|
11:35-11:40, Paper WeCT2.5 | |
User-Aware Collaborative Learning in Human-Robot Interactions |
|
Gucsi, Bálint | University of Southampton |
Tuyen, Nguyen Tan Viet | University of Southampton |
Chu, Bing | University of Southampton |
Tarapore, Danesh | University of Southampton |
Tran-Thanh, Long | University of Warwick |
Keywords: Social HRI, Human-Robot Teaming, Learning from Experience
Abstract: Our work investigates how social robots can efficiently collaborate with human users in a user-aware manner, minimising the generated frustration in human colleagues, thus enhancing their experience. As part of this, we develop a user-aware framework for human-robot collaborative learning. We model users’ frustration during human-robot interactions based on recent interactions inspired by Psychological principles and develop different frustration-aware interactive preference learning and decision-making models using multi-armed bandit and knapsack methods. Evaluating our approach, 1) we conducted simulated experiments on realistic human-behaviour datasets and 2) a user-study in which participants worked with a TIAGo Steel humanoid robot on a collaboration task using frustration-aware and non frustration-aware (Upper Confidence Bounds and Instruction-based) models. We demonstrate that when collaborating with the frustration-aware robot, users completed the collaboration task 9.04% faster and using 20.54% less number of verbal interactions, with user questionnaire responses reporting less frustration experienced compared to the baseline approaches. Additionally, we create a multimodal dataset containing over 6 hours of human-robot interactions displaying various explicit and implicit user responses.
|
|
11:40-11:45, Paper WeCT2.6 | |
Data-Efficient Learning from Human Interventions for Mobile Robots |
|
Peng, Zhenghao | University of California, Los Angeles |
Liu, Zhizheng | SenseTime |
Zhou, Bolei | University of California, Los Angeles |
Keywords: Human Factors and Human-in-the-Loop, Reinforcement Learning, Learning from Demonstration
Abstract: Mobile robots are essential in applications such as autonomous delivery and hospitality services. Applying learning-based methods to address mobile robot tasks has gained popularity due to its robustness and generalizability. Traditional methods such as Imitation Learning (IL) and Reinforcement Learning (RL) offer adaptability but require large datasets, carefully crafted reward functions, and face sim-to-real gaps, making them challenging for efficient and safe real-world deployment. We propose an online human-in-the-loop learning method PVP4Real that combines IL and RL to address these issues. PVP4Real enables efficient real-time policy learning from online human intervention and demonstration, without reward or any pretraining, significantly improving data efficiency and training safety. We validate our method by training two different robots---a legged quadruped, and a wheeled delivery robot---in two mobile robot tasks, one of which even uses raw RGBD image as observation. The training finishes {within 15 minutes}. Our experiments show the promising future of human-in-the-loop learning in addressing the data efficiency issue in real-world robotic tasks. More information is available at: https://metadriverse.github.io/pvp4real/
|
|
WeCT3 |
303 |
Mechanism Design 3 |
Regular Session |
Chair: Tadakuma, Kenjiro | Osaka University |
Co-Chair: Sibai, Hussein | Washington University in St. Louis |
|
11:15-11:20, Paper WeCT3.1 | |
A Morphing Quadrotor-Blimp with Balloon Failure Resilience for Mobile Ecological Sensing |
|
Sharma, Suryansh | Delft University of Technology |
Verhoeff, Mike | TU Delft |
Joosen, Floor Elisabeth | Delft University of Technology |
Venkatesha Prasad, RangaRao | Delft University of Technology |
Hamaza, Salua | TU Delft |
Keywords: Failure Detection and Recovery, Sensor Fusion, Aerial Systems: Mechanics and Control
Abstract: The increasing popularity of helium-assisted blimps for extended monitoring or data collection applications is hindered by a critical limitation -- single-point failure when the balloon malfunctions or bursts. To address this, we introduce Janus, a hybrid blimp-drone platform equipped with integrated balloon failure detection and recovery capability. Janus employs a triggered mechanism that seamlessly transitions the platform from a blimp to a standard quad-rotor drone. Utilizing multiple sensors and fusing their readings, we have developed a robust balloon failure detection system. Janus demonstrates omnidirectional mobility in blimp mode and transitions promptly into quadrotor mode upon receiving the signal. Our results affirm the successful recovery of the system from balloon failure, with a rapid response time of 66ms to balloon failure detection. The drone morphs into a quadrotor and achieves recovery within 0.362 seconds in 90% of cases. By amalgamating the enduring flight capabilities of blimps with the agility of quad-rotors within a morphing platform like Janus, we cater to applications demanding both prolonged flight duration and enhanced agility.
|
|
11:20-11:25, Paper WeCT3.2 | |
A Novel Passive Parallel Elastic Actuation Principle for Load Compensation in Legged Robots |
|
Zhang, Yifang | Istituto Italiano Di Tecnologia |
Jiang, Jingcheng | Istituto Italiano Di Tecnologia |
Tsagarakis, Nikos | Istituto Italiano Di Tecnologia |
Keywords: Mechanism Design, Actuation and Joint Mechanisms
Abstract: This work introduces a novel parallel elastic actuation principle designed to provide torque compensation for legged robots. Unlike existing solutions, the proposed concept leverages a nitrogen N2 gas spring combined with a cam roller module to generate a highly customizable torque compensation profile for the target leg joint. An optimization-based design approach is employed to derive the specifications of the gas spring and optimize the cam module to produce a compensation torque profile closest to the desired one. The proposed load compensation concept and related mechanism are experimentally evaluated and practically integrated into the knee joint of a two-DoF monopedal robot actuated by cycloid actuators. The experimental results demonstrate that the proposed principle can effectively generate the required compensation torque profile and achieve significant benefits for the prototyped monopedal robot system by reducing 71.92% of the additional energy consumption caused by the payload. The entire system is compact, easy to integrate, and highly customizable, enabling the creation of nonlinear torque compensation profiles as needed. The work provides a promising solution to load compensation in legged robots.
|
|
11:25-11:30, Paper WeCT3.3 | |
Mathematical Modeling and Rolling Motion Generation of Planar Seven-Link Robot That Forms Passive Closed and Active Open Chains |
|
Asano, Fumihiko | Japan Advanced Institute of Science and Technology |
Sedoguchi, Taiki | Japan Advanced Institute of Science and Technology |
Tokuda, Isao T. | Ritsumeikan University |
Keywords: Underactuated Robots, Mechanism Design, Motion Control
Abstract: This paper investigates the mathematical modeling and basic motion properties of planar seven-link robots that forms passive closed and active open chains. The passive closed model is formed by connecting seven rigid frames via seven viscoelastic joints, and the active open model is formed by connecting them via actuated joints. The former is a convex heptagonal model and can exhibit passive-dynamic rolling on a gentle downhill, whereas the latter virtually forms a forward-leaning octagonal shape by controlling the six relative joint angles. In the first half of this paper, we describe the model assumptions and develop the mathematical equations of motion and collision of the passive closed model, and numerically analyze the motion characteristics by changing the slope angle while checking the conditions necessary for stable motion generation. In the second half, we outline the active open model, develop the PD control system, and numerically analyze the motion characteristics by changing the target angle parameter that controls the degree of forward lean of the virtual octagon.
|
|
11:30-11:35, Paper WeCT3.4 | |
LEVA: A High-Mobility Logistic Vehicle with Legged Suspension |
|
Arnold, Marco | ETH Zürich |
Hildebrandt, Lukas | ETH Zürich |
Janssen, Kaspar | ETH Zürich |
Ongan, Efe | Ethz - Rsl |
Bürge, Pascal | ZHAW / Zurich University of Applied Sciences |
Gábriel, Ádám Gyula | ETH Zürich |
Kennedy, James | ETH Zürich |
Lolla, Rishi | ETH Zurich |
Oppliger, Quanisha | ZHAW Zurich University of Applied Sciences |
Schaaf, Micha | ZHAW Zurich University of Applied Sciences |
Church, Joseph | ETH RSL |
Fritsche, Michael Xaver | ETH Zurich |
Klemm, Victor | ETH Zurich |
Tuna, Turcan | ETH Zurich, Robotic Systems Lab |
Valsecchi, Giorgio | Robotic System Lab, ETH |
Weibel, Cedric | ETH Zuerich |
Hutter, Marco | ETH Zurich |
Wüthrich, Michael | ZHAW Zurich University of Applied Sciences |
Keywords: Field Robots, Mechanism Design, Legged Robots
Abstract: The autonomous transportation of materials over challenging terrain is a challenge with major economic implications and remains unsolved. This paper introduces LEVA, a high-payload, high-mobility robot designed for autonomous logistics across varied terrains, including those typical in agriculture, construction, and search and rescue operations. LEVA uniquely integrates an advanced legged suspension system using parallel kinematics. It is capable of traversing stairs using a rl controller, has steerable wheels, and includes a specialized box pickup mechanism that enables autonomous payload loading as well as precise and reliable cargo transportation of up to 85 kg across uneven surfaces, steps and inclines while maintaining a cot of as low as 0.15. Through extensive experimental validation, LEVA demonstrates its off-road capabilities and reliability regarding payload loading and transport.
|
|
11:35-11:40, Paper WeCT3.5 | |
Safe Decentralized Multi-Agent Control Using Black-Box Predictors, Conformal Decision Policies, and Control Barrier Functions |
|
Huriot, Sacha | Washington University in St. Louis |
Sibai, Hussein | Washington University in St. Louis |
Keywords: Robust/Adaptive Control, Robot Safety, Machine Learning for Robot Control
Abstract: We address the challenge of safe control in decentralized multi-agent robotic settings, where agents use uncertain black-box models to predict other agents' trajectories. We use the recently proposed conformal decision theory to adapt the restrictiveness of control barrier functions-based safety constraints based on observed prediction errors. We use these constraints to synthesize controllers that balance between the objectives of safety and task accomplishment, despite the prediction errors. We provide an upper bound on the average over time of the value of a monotonic function of the difference between the safety constraint based on the predicted trajectories and the constraint based on the ground truth ones. We validate our theory through experimental results showing the performance of our controllers when navigating a robot in the multi-agent scenes in the Stanford Drone Dataset.
|
|
11:40-11:45, Paper WeCT3.6 | |
Poloidal Drive: Direct-Drive Transmission Mechanism for Active Omni-Wheels with Spoke Interference Avoidance |
|
Sano, Shunsuke | Osaka University |
Tadakuma, Kenjiro | Osaka University |
Kayawake, Ryotaro | Tohoku University |
Watanabe, Masahiro | Osaka University |
Abe, Kazuki | Osaka University |
Kemmotsu, Yuto | Tohoku University |
Tadokoro, Satoshi | Tohoku University |
Keywords: Mechanism Design, Wheeled Robots
Abstract: Wheels require extra space for steering. Omnidirectional wheels are ideal for confined spaces as they can move in all directions: forward/backward and left/right. Conventional omnidirectional wheels with passive rollers achieve this movement by combining multiple wheels. However, if even one wheel loses contact with the ground, the vehicle becomes inoperable. To overcome this limitation, omnidirectional wheels with actively driven rollers have been proposed. These designs, however, require additional components, which increase weight. This is because multi-step intermediate transmission mechanisms are needed to convert spindle rotation into roller rotation. Eliminating the intermediate transmission mechanism reduces the number of components and provides more space to enhance wheel strength. This study proposed a mechanism without intermediate transmission, clarified its design framework, and experimentally demonstrated its feasibility as an active omnidirectional wheel. The proposed design framework defines conditions to maximize both power transmission efficiency and strength. Experimental results showed that the transmission efficiency of the proposed mechanism is comparable to that of conventional mechanisms.
|
|
WeCT4 |
304 |
Sensor Fusion 2 |
Regular Session |
Co-Chair: Song, Dezhen | Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) and Texas A&M University (TAMU) |
|
11:15-11:20, Paper WeCT4.1 | |
An End-To-End Learning-Based Multi-Sensor Fusion for Autonomous Vehicle Localization |
|
Lin, Changhong | DiDi Autonomous Driving |
Lin, Jiarong | The University of Hong Kong |
Sui, Zhiqiang | University of Michigan |
Qu, Xiaozhi | Didichuxing |
Wang, Rui | DiDi Autonomous Driving |
Sheng, Kehua | DIdi Inc |
Zhang, Bo | DIdi Inc |
Keywords: Localization, Sensor Fusion
Abstract: Multi-sensor fusion is essential for autonomous vehicle localization, as it is capable of integrating data from various sources for enhanced accuracy and reliability. The accuracy of the integrated location and orientation depends on the precision of the uncertainty modeling. Traditional methods of uncertainty modeling typically assume a Gaussian distribution and involve manual heuristic parameter tuning. However, these methods struggle to scale effectively and address long-tail scenarios. To address these challenges, we propose a learning-based method that encodes sensor information using higher-order neural network features, thereby eliminating the need for uncertainty estimation. This method significantly eliminates the need for parameter fine-tuning by developing an end-to-end neural network that is specifically designed for multi-sensor fusion. In our experiments, we demonstrate the effectiveness of our approach in real-world autonomous driving scenarios. Results show that the proposed method outperforms existing multi-sensor fusion methods in terms of both accuracy and robustness. A video of the results can be viewed at https://youtu.be/q4iuobMbjME.
|
|
11:20-11:25, Paper WeCT4.2 | |
Unleashing HyDRa: Hybrid Fusion, Depth Consistency and Radar for Unified 3D Perception |
|
Wolters, Philipp | Technical University of Munich |
Gilg, Johannes | Technical University of Munich |
Teepe, Torben | Technical University of Munich |
Herzog, Fabian | Technical University of Munich |
Laouichi, Anouar | Technical University of Munich |
Hofmann, Martin | Fusionride Technology (Germany) GmbH |
Rigoll, Gerhard | Technische Universität München |
Keywords: Sensor Fusion, Semantic Scene Understanding, Object Detection, Segmentation and Categorization
Abstract: Low-cost, vision-centric 3D perception systems for autonomous driving have made significant progress in recent years, narrowing the gap to expensive LiDAR-based methods. The primary challenge in becoming a fully reliable alternative lies in robust depth prediction capabilities, as camera-based systems struggle with long detection ranges and adverse lighting and weather conditions. In this work, we introduce HyDRa, a novel camera-radar fusion architecture for diverse 3D perception tasks. Building upon the principles of dense Bird's-Eye-View (BEV)-based architectures, HyDRa introduces a hybrid fusion approach to combine the strengths of complementary camera and radar features in two distinct representation spaces. Our Height Association Transformer module leverages radar features already in the perspective view to produce more robust and accurate depth predictions. In the BEV, we refine the initial sparse representation by a adar-weighted Depth Consistency. HyDRa achieves a new state-of-the-art for camera-radar fusion of 64.2 NDS (+1.8) and 58.4 AMOTA (+1.5) on the public nuScenes dataset. Moreover, our new semantically rich and spatially accurate BEV features can be directly converted into a powerful occupancy representation, beating all previous camera-based methods on the Occ3D benchmark by an impressive 3.7 mIoU. Code and models are available at https://github.com/phi-wol/hydra.
|
|
11:25-11:30, Paper WeCT4.3 | |
VIP-Dock: Vision, Inertia, and Pressure Sensor Fusion for Underwater Docking with Optical Beacon Guidance |
|
Zhang, Suohang | Zhejiang University |
Qian, Shipang | Zhejiang University |
Wang, Lu | Zhejiang University |
Fei, Xinyu | Zhejiang University |
Chen, Yanhu | Zhejiang University |
Keywords: Sensor Fusion, Marine Robotics, Sensor-based Control
Abstract: Underwater docking enhances the operational capabilities of Autonomous Underwater Vehicles (AUVs) by facilitating energy and data transfer. Optical beacons serve as the primary guidance method for AUVs to localize and track docking stations. This paper presents VIP-Dock, a novel optical beacon tracking algorithm for robust underwater docking of AUVs. VIP-Dock addresses the challenge of maintaining accurate beacon tracking under visual interference by integrating visual, inertial, and pressure perception. Employing an unscented Kalman filter framework, the VIP-Dock algorithm provides continuous optimal estimation of beacon positions. Experimental results demonstrated VIP-Dock's real-time tracking performance in actual docking scenarios and its ability to maintain accuracy during visual input failure. Implementation in a digital twin system for an underwater vertical shuttle showed significant improvement, increasing docking success rates from 62% to 84% across 100 trials under simulated current disturbances.
|
|
11:30-11:35, Paper WeCT4.4 | |
Heterogeneous Sensor Fusion and Active Perception for Transparent Object Reconstruction with a PDM^2 Sensor and a Camera |
|
Guo, Fengzhi | Texas A&M University |
Xie, Shuangyu | Texas A&M University |
Wang, Di | Texas A&M University |
Fang, Cheng | Texas A&M University |
Zou, Jun | Texas A&M University |
Song, Dezhen | Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) |
Keywords: Sensor Fusion, Perception for Grasping and Manipulation
Abstract: Transparent household objects present a challenge for domestic service robots, since neither regular cameras nor RGB-D cameras can provide accurate points for shape reconstruction. The new type of pretouch dual-modality distance and material sensor (PDM^2) can provide reliable and accurate depth readings, but it is a point sensor and scanning the object exclusively with the sensor is too inefficient. Hence, we present a sensor fusion approach by combining a regular camera with the PDM^2 sensor. The approach is based on a data fusion algorithm for shape reconstruction and an active perception algorithm for scan planning for the PDM^2 sensor. The data fusion algorithm is a distributed Gaussian process (GP)-based shape reconstruction method that allows for incremental local update to reduce computational time. The active perception algorithm is an optimization-based approach by increasing the information gain (IG) and prioritizing the boundary points under a preset travel distance constraint. We have implemented and tested the algorithms with six different transparent household items. The results show satisfactory shape reconstruction results in all test cases with an average increase in intersection over union (IoU) from 0.73 to 0.96.
|
|
11:35-11:40, Paper WeCT4.5 | |
DA-Fusion: Deformable Attention-Based RGB-D Fusion Transformer for Unseen Object Instance Segmentation |
|
Park, Yesol | Seoul National University |
Yoon, Hye Jung | Seoul National University |
Kim, Juno | Seoul National University |
Zhang, Byoung-Tak | Seoul National University |
Keywords: Logistics, Object Detection, Segmentation and Categorization, Deep Learning for Visual Perception
Abstract: In logistics automation, accurately segmenting unseen objects is essential for tasks such as bin picking, shelf picking, and warehouse sorting, which involve complex and cluttered environments. Traditional RGB-based methods tend to over-segment objects due to their reliance on texture, while depth-based methods often under-segment by focusing primarily on geometric features. To address these limitations, we propose DA-Fusion, a deformable attention-based RGB-D fusion Transformer designed for unseen object instance segmentation. DA-Fusion effectively combines the strengths of both RGB and depth data, enhancing segmentation accuracy in cluttered and multi-layered object environments. We also introduce the Object Clutter Bin Dataset (OCBD), a benchmark dataset specifically tailored for evaluating bin-picking scenarios in top-down views. Extensive evaluations demonstrate that DA-Fusion outperforms state-of-the-art methods across diverse environments, making it particularly suited for real-world logistics tasks.
|
|
11:40-11:45, Paper WeCT4.6 | |
PAIR360: A Paired Dataset of High-Resolution 360˚ Panoramic Images and LiDAR Scans |
|
Kim, Geunu | Kyung Hee University |
Kim, Daeho | Kyung Hee University |
Jang, Jaeyun | Kyung Hee University |
Hwang, Hyoseok | Kyung Hee University |
Keywords: Data Sets for SLAM, Sensor Fusion, Omnidirectional Vision
Abstract: The 360˚ camera is a compact omnidirectional perception system for capturing panoramic images with the same field of view as LiDAR. This boosts its versatility for use in autonomous driving and robotics. However, most existing datasets of 360˚ panoramic images primarily focus on indoor or virtual environments, or they offer only low-resolution outdoor images and LiDAR configurations. In this letter, we present PAIR360, a multi-modal dataset encompassing high-resolution 360˚ camera images and 3D LiDAR scans, aimed at stimulating research in computer vision. To this end, we collected a comprehensive dataset at Kyung Hee University Global Campus, capturing 52 sequences from 7 different areas under diverse atmospheric conditions, including sunny, cloudy, and sunrise. The dataset features 8K resolution panoramic imagery, six fisheye images, point clouds, GPS, and IMU data, all synchronized using LiDAR timestamps and calibrated across visual sensors. We also provide additional data, such as depth maps, segmentation, and 3D maps, to demonstrate the feasibility of our dataset and its application to various computer vision tasks. The dataset is available for download at: https://airlabkhu.github.io/PAIR-360-Dataset/
|
|
WeCT5 |
305 |
Aerial Manipulation 2 |
Regular Session |
Chair: Katzschmann, Robert Kevin | ETH Zurich |
Co-Chair: Panetsos, Fotis | New York University Abu Dhabi |
|
11:15-11:20, Paper WeCT5.1 | |
NDOB-Based Control of a UAV with Delta-Arm Considering Manipulator Dynamics |
|
Chen, Hongming | Sun Yat-Sen University |
Ye, Biyu | Sun Yat-Sen University |
Liang, Xianqi | Sun Yat-Sen University |
Deng, Weiliang | Sun Yat-Sen University |
Lyu, Ximin | Sun Yat-Sen University |
Keywords: Aerial Systems: Mechanics and Control
Abstract: Aerial Manipulators (AMs) provide a versatile platform for various applications, including 3D printing, architecture, and aerial grasping missions. However, their operational speed is often sacrificed to uphold precision. Existing control strategies for AMs often regard the manipulator as a disturbance and employ robust control methods to mitigate its influence. This research focuses on elevating the precision of the end effector and enhancing the agility of aerial manipulator movements. We present a composite control scheme to address these challenges. Initially, a Nonlinear Disturbance Observer (NDOB) is utilized to compensate for internal coupling effects and external disturbances. Subsequently, manipulator dynamics are processed through a high pass filter to facilitate agile movements. By integrating the proposed control method into a fully autonomous delta-arm-based AM system, we substantiate the controller's efficacy through extensive real-world experiments. The outcomes illustrate that the end-effector can achieve accuracy at the millimeter level.
|
|
11:20-11:25, Paper WeCT5.2 | |
Flapping-Wing Flying Robot with Integrated Dual-Arm Scissors-Type Flora Sampling System |
|
Gordillo Durán, Rodrigo | Universidad De Sevilla |
Tapia, Raul | University of Seville |
Rafee Nekoo, Saeed | GRVC Robotics Lab, Universidad De Sevilla |
Martinez-de Dios, J.R. | University of Seville |
Ollero, Anibal | AICIA. G41099946 |
Keywords: Aerial Systems: Applications, Mechanism Design, Computer Vision for Automation
Abstract: The flapping-wing robotic birds were inspired by nature to present an alternative way of thrust and lift generation instead of conventional high-speed rotary propellers in unmanned aerial platforms. The advances in flapping technology recently led to the prototyping of leg-claw mechanisms for perching and occasionally very lightweight arms for sampling or tiny object aerial manipulation. A dual-arm manipulator on top of a robotic bird might not be bio-inspired and safe in case of a collision with the environment or human-robot interaction. Here in this work, the previously designed dual-arm scissors-type manipulator has been improved in terms of workspace, mechanism, vision system, and blade placement to present a more natural way of sampling. The new dual-arm, with 100.2(g) weight, is redesigned inside a beak to have protection against possible collisions and also secure the cutting blades within a protected shield. During the flight, the dual-arm system is inside the cover and invisible; the lower beak is opened before manipulation and sets out the arm in a proper place for sampling. This new safety cover (beak) along with the new blade mechanism enhanced the cutting power and the safety of the operation. The experimental results show the successful cutting of a series of plant samples.
|
|
11:25-11:30, Paper WeCT5.3 | |
Reliable Aerial Manipulation: Combining Visual Tracking with Range Sensing for Robust Grasping |
|
Blöchlinger, Marc | ETHZ |
Toshimitsu, Yasunori | ETH Zurich |
Katzschmann, Robert Kevin | ETH Zurich |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control, Mobile Manipulation
Abstract: Reliable object localization is a critical challenge in drone-based aerial manipulation, particularly when objects are outside the camera's field of view. This paper presents a new approach to enhance drone reliability in aerial grasping tasks by integrating a 1D time-of-flight range sensor with a vision-based localization system. The range sensor, positioned beneath the drone, generates a detailed point cloud of the ground beneath the drone, allowing for precise object localization even when the drone hovers directly above the target. By combining visual tracking with real-time distance measurements, our system achieves a 96% grasp success rate across 128 trials with diverse objects, representing a significant improvement over previous approaches. This method enables zero-shot grasping without prior knowledge of the objects, increasing versatility and robustness in complex, unstructured environments. The open-source software and hardware design of the platform provides a foundation for further research and development in the field of autonomous aerial manipulation.
|
|
11:30-11:35, Paper WeCT5.4 | |
Safety-Critical Control for Aerial Physical Interaction in Uncertain Environment |
|
Byun, Jeonghyun | Seoul National University |
Kim, Yeonjoon | Seoul National University |
Lee, Dongjae | Seoul National University |
Kim, H. Jin | Seoul National University |
Keywords: Aerial Systems: Mechanics and Control, Robot Safety, Robust/Adaptive Control
Abstract: Aerial manipulation for safe physical interaction with their environments is gaining significant momentum in robotics research. In this paper, we present a disturbance-observer-based safety-critical control for a fully actuated aerial manipulator interacting with both static and dynamic structures. Our approach centers on a safety filter that dynamically adjusts the desired trajectory of the vehicle's pose, accounting for the aerial manipulator's dynamics, the disturbance observer's structure, and motor thrust limits. We provide rigorous proof that the proposed safety filter ensures the forward invariance of the safety set—representing motor thrust limits—even in the presence of disturbance estimation errors. To demonstrate the superiority of our method over existing control strategies for aerial physical interaction, we perform comparative experiments involving complex tasks, such as pushing against a static structure and pulling a plug firmly attached to an electric socket. Furthermore, to highlight its repeatability in scenarios with sudden dynamic changes, we perform repeated tests of pushing a movable cart and extracting a plug from a socket. These experiments confirm that our method not only outperforms existing methods but also excels in handling tasks with rapid dynamic variations.
|
|
11:35-11:40, Paper WeCT5.5 | |
SPIBOT: A Drone-Tethered Mobile Gripper for Robust Aerial Object Retrieval in Dynamic Environments |
|
Kang, Gyuree | Korea Advanced Institute of Science and Technology (KAIST) |
Guenes, Ozan | Korea Advanced Institute of Science and Technology |
Lee, Seungwook | Korea Advanced Institute of Science and Technology |
Azhari, Maulana Bisyir | Korea Advanced Institute of Science and Technology |
Shim, David Hyunchul | KAIST |
Keywords: Aerial Systems: Applications, Field Robots, Marine Robotics
Abstract: In real-world field operations, aerial grasping systems face significant challenges in dynamic environments due to strong winds, shifting surfaces, and the need to handle heavy loads. Particularly when dealing with heavy objects, the powerful propellers of the drone can inadvertently blow the target object away as it approaches, making the task even more difficult. To address these challenges, we introduce SPIBOT, a novel drone-tethered mobile gripper system designed for robust and stable autonomous target retrieval. SPIBOT operates via a tether, much like a spider, allowing the drone to maintain a safe distance from the target. To ensure both stable mobility and secure grasping capabilities, SPIBOT is equipped with six legs and sensors to estimate the robot's and mission's states. It is designed with a reduced volume and weight compared to other hexapod robots, allowing it to be easily stowed under the drone and reeled in as needed. Designed for the 2024 MBZIRC Maritime Grand Challenge, SPIBOT is built to retrieve a 1kg target object in the highly dynamic conditions of the moving deck of a ship. This system integrates a real-time action selection algorithm that dynamically adjusts the robot's actions based on proximity to the mission goal and environmental conditions, enabling rapid and robust mission execution. Experimental results across various terrains, including a pontoon on a lake, a grass field, and rubber mats on coastal sand, demonstrate SPIBOT's ability to efficiently and reliably retrieve targets. SPIBOT swiftly converges on the target and completes its mission, even when dealing with irregular initial states and noisy information introduced by the drone.
|
|
11:40-11:45, Paper WeCT5.6 | |
GP-Based NMPC for Aerial Transportation of Suspended Loads |
|
Panetsos, Fotis | New York University Abu Dhabi |
Karras, George | University of Thessaly |
Kyriakopoulos, Kostas | New York University - Abu Dhabi |
Keywords: Aerial Systems: Applications, Field Robots, Motion Control
Abstract: In this work, we leverage Gaussian Processes (GPs) and present a learning-based control scheme for the transportation of cable-suspended loads with multirotor Unmanned Aerial Vehicles (UAVs). Our ultimate goal is to approximate the model discrepancies that exist between the actual and nominal system dynamics. Towards this direction, weighted and sparse Gaussian Process (GP) regression is exploited so as to approximate online the model errors and guarantee real-time performance while also ensuring adaptability to the conditions prevailing in the outdoor environment where the UAV is deployed. The learned model errors are fed into a nonlinear Model Predictive Controller (NMPC), formulated for the corrected system dynamics, which achieves the transportation of the UAV towards reference positions with simultaneous minimization of the cable angular motion, regardless of the outdoor conditions and the existence of external disturbances, primarily stemming from the unknown wind. The proposed scheme is validated through simulations and real-world experiments with an octorotor, demonstrating an 80% reduction in the steady-state position error under 4 Beaufort wind conditions compared to the nominal NMPC.
|
|
WeCT6 |
307 |
Vision-Based Navigation 3 |
Regular Session |
Co-Chair: Chang, Yan | Nvidia |
|
11:15-11:20, Paper WeCT6.1 | |
Knowledge-Driven Visual Target Navigation: Dual Graph Navigation |
|
Li, Shiyao | Dalian University of Technology |
Meng, Ziyang | Dalian University of Technology |
Pei, JianSong | Dalian University of Technology |
Chen, Jiahao | Institute of Automation, Chinese Academy of Sciences |
Dong, BingCheng | Dalian University of Technology |
Li, Guangsheng | Dalian University of Technology |
Liu, Shenglan | Dalian University of Technology |
Wang, Feilong | Dalian University of Technology |
Keywords: Vision-Based Navigation, Semantic Scene Understanding, Robotics in Under-Resourced Settings
Abstract: In unknown environments, navigating a robot by a given image to a specific location or instance is critical and challenging. The existing end-to-end approaches require simultaneous implicit learning of multiple subtasks, and modular approaches depend on metric information. Both approaches face high computational demands, often leading to difficulties in real-time updates and limited generalization, making them challenging to implement on resource-constrained devices. To address these challenges, we propose Dual Graph Navigation (DGN), a knowledge-driven, lightweight image instance navigation framework. DGN builds an External Knowledge Graph (EKG) from small-scale datasets to capture prior object correlations, efficiently guiding target exploration. During exploration, DGN builds an Internal Knowledge Graph (IKG) using an instance-aware module, which records explored objects based on reachability relationships rather than precise metric information. The IKG dynamically updates the EKG, enhancing the robot's adaptability to the current environment. Together, they realize topological perception and reduce computational overhead. Furthermore, unlike approaches characterized by over-dependence between components, DGN employs a plug-and-play modular design that allows independent training and flexible replacement of functional modules, effectively enhancing generalization performance while reducing training and deployment costs. Experiments illustrate that DGN generalizes well in different simulation environments (AI2-THOR, Habitat), achieving state-of-the-art performance on the ProcTHOR-10K dataset. It is compatible with three distinct real-world robot platforms, including edge computing devices without CUDA support. It exhibits a decision-making speed of 3.8 to 5.5 times over baseline methods. Further details can be found on the project page:https://dogplanningloyo.github.io/DGN/
|
|
11:20-11:25, Paper WeCT6.2 | |
Learning to Predict the Future from Monocular Vision for Efficient Human-Aware Navigation |
|
Huang, Yushuang | Institute of Computing Technology, Chinese Academy of Sciences |
Jiang, Hao | Institute of Computing Technology, Chinese Academy of Sciences |
Liu, Zihan | Institute of Computing Technology of the Chinese Academy of Scie |
Ouyang, Wanli | The University of Sydney |
Wang, Zhaoqi | Institute of Computing Technology, the Chinese Academy of Scienc |
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, AI-Enabled Robotics
Abstract: Human-aware navigation (HAN) aims to build autonomous agents that robustly and naturally navigate in human-centered environments. Due to the complex and dynamic nature of this task, existing approaches typically rely on sophisticated pipelines that separately process perception and decision-making to solve it. In this work, we propose an Obstruction Distance Vector based End-to-End Model (ODVEEM), using monocular vision for navigation around humans. The Obstruction Distance Vector (ODV) is an intermediate representation in our model, leveraged to describe the Obstruction Distance to the first future collision in all possible directions in the horizontal field of view. As ODV cannot be calculated directly in the real world, we design a neural network for ODV estimation, formulating it as a classification problem with auxiliary proxy tasks, which play a key role in effectively predicting the implicit future motion of nearby humans. Taking advantage of ODV, ODVEEM supervised by human behavioral heuristics is employed to guide the agent to reach a goal efficiently and avoid potential collisions. Several challenging experiments show our method's substantial improvement over a number of baseline methods, attaining solid performance with zero-shot transfer to unseen simulated and real-world environments.
|
|
11:25-11:30, Paper WeCT6.3 | |
DP-Habitat: Bridging the Gap between Simulation and Reality for Visual Navigation in Dynamic Pedestrian Environments |
|
Qin, Liang | University of Science and Technology of China |
Wang, Min | Hefei Comprehensive National Science Center |
Wang, Haodong | University of Science and Technology of China |
Zhou, Wengang | University of Science and Technology of China |
Li, Houqiang | University of Science and Technology of China |
Keywords: Vision-Based Navigation
Abstract: Visual navigation in dynamic environments poses a considerable challenge, particularly in scenarios with diverse pedestrian behaviors. Traditional simulators primarily focus on static scenes, while existing dynamic pedestrian simulators often suffer limitations such as monotonous pedestrian models, lack of interaction with the environment, and constrained scenarios. These deficiencies lead to notable discrepancies from real-world dynamic pedestrian environments. To bridge this gap, we introduce DP-Habitat, a dynamic pedestrian simulator developed on the Habitat platform. DP-Habitat efficiently simulates a wide range of complex and realistic human behaviors, with flexible interactions between pedestrian models and environments. It also supports rapid deployment of pedestrian models across various scenes, thereby more accurately replicating the complexities of real-world dynamic pedestrian settings. Additionally, we present Adaptive Object Navigation with Dynamic Mapping (AON-DM), a novel baseline method specifically designed for dynamic pedestrian settings. AON-DM integrates real-time pedestrian tracking and predictive modeling with a hybrid path planning strategy, markedly improving navigation efficiency and success rates. Our experimental results reveal that dynamic pedestrians significantly affect visual navigation performance within DP-Habitat, with AON-DM achieving superior effectiveness compared to existing methods under these challenging conditions. Furthermore, our approach maintains high performance in real-world scenarios, highlighting its practical applicability and robustness. The code and data are available at url{https://github.com/qinliangql/DP-Habitat.git}.
|
|
11:30-11:35, Paper WeCT6.4 | |
X-MOBILITY: End-To-End Generalizable Navigation Via World Modeling |
|
Liu, Wei | Nvidia |
Zhao, Huihua | Georgia Tech |
Li, Chenran | University of California, Berkeley |
Biswas, Joydeep | University of Texas at Austin |
Okal, Billy | University of Freiburg |
Goyal, Pulkit | Nvidia |
Chang, Yan | Nvidia |
Pouya, Soha | Stanford University |
Keywords: Vision-Based Navigation, Learning from Demonstration, Probabilistic Inference
Abstract: General-purpose navigation in challenging environments remains a significant problem in robotics, with current state-of-the-art approaches facing myriad limitations. Classical approaches struggle with cluttered settings and require extensive tuning, while learning-based methods face difficulties generalizing to out-of-distribution environments. This paper introduces xmobility{}, an end-to-end generalizable navigation model that overcomes existing challenges by leveraging three key ideas. First, xmobility{} employs an auto-regressive world modeling architecture with a latent state space to capture world dynamics. Second, a diverse set of multi-head decoders enables the model to learn a rich state representation that correlates strongly with effective navigation skills. Third, by decoupling world modeling from action policy, our architecture can train effectively on a variety of data sources, both with and without expert policies—off-policy data allows the model to learn world dynamics, while on-policy data with supervisory control enables optimal action policy learning. Through extensive experiments, we demonstrate that xmobility{} not only generalizes effectively but also surpasses current state-of-the-art navigation approaches. Additionally, xmobility{} also achieves zero-shot Sim2Real transferability and shows strong potential for cross-embodiment generalization. Project page: https://nvlabs.github.io/X-MOBILITY
|
|
11:35-11:40, Paper WeCT6.5 | |
Map-SemNav: Advancing Zero-Shot Continuous Vision-And-Language Navigation through Visual Semantics and Map Integration |
|
Wu, Shuai | Tianjin University |
Liu, Ruonan | Shanghai Jiao Tong University |
Xie, Zongxia | Tianjin University |
Pang, Zhibo | KTH Royal Institute of Technology |
Keywords: Vision-Based Navigation, Agent-Based Systems, Autonomous Agents
Abstract: This paper explores zero-shot Vision-and-Language Navigation (VLN), enabling agents to generalize navigation to unseen data classes. Most current approaches rely on large models, but these are not specifically tailored for VLN, lacking direct learning from navigation environments and slowing down agents due to their overwhelming size. To tackle this, we propose Map-Semantic Zero-shot Navigation (Map-SemNav), which does not rely on large models for navigation planning. Map-SemNav utilizes three key cues: direction, object, and scene, to acquire relational knowledge instead of memorizing specific classes, which enables generalization to unseen data. Direction is guided by a top-down semantic map, while object and scene information is decoupled from environment knowledge. Extensive experiments demonstrate that Map-SemNav outperforms state-of-the-art large model-based methods in zero-shot VLN tasks within continuous environments, while also offering higher efficiency due to its simplified architecture.
|
|
11:40-11:45, Paper WeCT6.6 | |
Safer Gap: Safe Navigation of Planar Nonholonomic Robots with a Gap-Based Local Planner |
|
Feng, Shiyu | Georgia Institute of Technology |
Abuaish, Ahmad | Georgia Institute of Technology |
Vela, Patricio | Georgia Institute of Technology |
Keywords: Vision-Based Navigation, Collision Avoidance, Reactive and Sensor-Based Planning
Abstract: This paper extends the gap-based navigation technique Potential Gap with safety guarantees at the local planning level for a kinematic planar nonholonomic robot model, leading to Safer Gap. It relies on a subset of navigable free space from the robot to a gap, denoted the keyhole region. The region is defined by the union of the largest collision-free disc centered on the robot and a collision-free trapezoidal region directed through the gap. Safer Gap first generates Bezier-based collision-free paths within the keyhole regions. The keyhole region of the top scoring path is encoded by a shallow neural network-based zeroing barrier function (ZBF) synthesized in real-time. Nonlinear Model Predictive Control (NMPC) with Keyhole ZBF constraints and output tracking of the Bezier path, synthesizes a safe kinematically feasible trajectory. The Potential Gap projection operator serves as a last action to enforce safety if the NMPC optimization fails to converge to a solution within the prescribed time. Simulation and experimental validation of Safer Gap confirm its collision-free navigation properties.
|
|
WeCT7 |
309 |
Marine Robotics 4 |
Regular Session |
|
11:15-11:20, Paper WeCT7.1 | |
Bathymetric Surveying with Imaging Sonar Using Neural Volume Rendering |
|
Xie, Yiping | Linköping University |
Troni, Giancarlo | Monterey Bay Aquarium Research Institute |
Bore, Nils | KTH Royal Institute of Technology |
Folkesson, John | KTH |
Keywords: Marine Robotics, Mapping, Deep Learning Methods
Abstract: This research addresses the challenge of estimating bathymetry from imaging sonars where the state-of-the-art works have primarily relied on either supervised learning with ground-truth labels or surface rendering based on the Lambertian assumption. In this letter, we propose a novel, self-supervised framework based on volume rendering for reconstructing bathymetry using forward-looking sonar (FLS) data collected during standard surveys. We represent the seafloor as a neural heightmap encapsulated with a parametric multi-resolution hash encoding scheme and model the sonar measurements with a differentiable renderer using sonar volumetric rendering employed with hierarchical sampling techniques. Additionally, we model the horizontal and vertical beam patterns and estimate them jointly with the bathymetry. We evaluate the proposed method quantitatively on simulation and field data collected by remotely operated vehicles (ROVs) during low-altitude surveys. Results show that the proposed method outperforms the current state-of-the-art approaches that use imaging sonars for seabed mapping. We also demonstrate that the proposed approach can potentially be used to increase the resolution of a low-resolution prior map with FLS data from low-altitude surveys.
|
|
11:20-11:25, Paper WeCT7.2 | |
Diver to Robot Communication Underwater |
|
Codd-Downey, Robert | York University |
Jenkin, Michael | York University |
Keywords: Marine Robotics, Gesture, Posture and Facial Expressions, Human-Robot Collaboration
Abstract: Gesture-based communication is a standard underwater communication strategy that is taught to divers as part of their regular diver training and it would seem a natural mechanism to leverage for diver to robot communication underwater. Enabling an unmanned underwater vehicle (UUV) to understand such sequences would involve having the robot learn the large set of gestures that divers use and the way they are combined. As perfect transcription of gestures is unlikely, the communication process also requires an error-correcting framework to ensure that communication is clear and correct. Here we describe an interactive process that provides this infrastructure. A weakly supervised transfer learning approach is used to recognize standard SCUBA gestures in individual video frames and within a Sim2Real process to train a LSTM to recognize gesture sequences. This process is placed within a per-gesture and per-sequence interaction process to assist and confirm the recognition of individual gestures and to confirm entire gesture sequences. Individual aspects of this process and complete end-to-end operation are demonstrated using an unmanned underwater vehicle.
|
|
11:25-11:30, Paper WeCT7.3 | |
SIMP: Energy and Time-Efficient Real-Time 3D Motion Planning for Bio-Inspired AUVs |
|
Bjørlo, August Sletnes | NTNU |
Xanthidis, Marios | SINTEF Ocean |
Føre, Martin | NTNU |
Kelasidi, Eleni | NTNU |
Keywords: Marine Robotics, Collision Avoidance, Biologically-Inspired Robots
Abstract: Underwater navigation is an area of increasing research interest due to its fundamental complexity and industrial applications. Though, due to convenience and current theoretical understanding, the vast majority of underwater platforms utilize thrusters, while other forms of propulsion, such as undulation locomotion, have been given limited exposure. This paper provides the first real-time motion planning framework that produces energy and time efficient paths with empirical local optimality for articulated swimming robots in 3D, called SIMP. SIMP utilizes learned associations between parameterized dynamically feasible undulatory gaits with their expected energy cost, velocity, and swept-out volume of the robot during execution, to formulate a simplified optimization problem that decides the path to be followed with the corresponding consecutive gaits, and navigates the robot safely in complex 3D environments. The proposed pipeline is tested in numerical experiments with realistic dynamics for a 10-link underwater snake robot (USR) with anguilliform gaits, in simulated cluttered environments of significant challenge, displaying real-time replanning performance of more than 1 Hz.
|
|
11:30-11:35, Paper WeCT7.4 | |
End-To-End Underwater Multi-View Stereo for Dense Scene Reconstruction |
|
Yang, Guidong | The Chinese University of Hong Kong |
Wen, Junjie | The Chinese University of Hong Kong |
Zhao, Benyun | The Chinese University of Hong Kong |
Li, Qingxiang | The Chineses University of Hong Kong |
Huang, Yijun | The Chinese University of Hong Kong |
Lei, Lei | City University of Hong Kong |
Chen, Xi | The Chinese University of Hong Kong |
Lam, Alan Hiu-Fung | The Chinese University of Hong Kong, |
Chen, Ben M. | Chinese University of Hong Kong |
Keywords: Marine Robotics, Data Sets for Robotic Vision, Deep Learning for Visual Perception
Abstract: Recent advancements in learning-based multi-view stereo (MVS) have demonstrated significant improvements over traditional counterpart, primarily due to the extensive availability of multi-view training images with ground-truth metric depths in the terrestrial in-air domain. However, underwater multi-view stereo (UwMVS) faces substantial challenges arising from the domain gap between in-air and underwater environments, leading to degraded performance when applying in-air MVS models to underwater scenarios. Furthermore, the progress of learning-based UwMVS methods has been hindered by the scarcity of underwater multi-view images with ground-truth depth maps and point clouds. In this paper, we address these challenges by introducing a physically-guided approach for synthesizing underwater multi-view images and present the first large-scale UwMVS dataset for end-to-end training and evaluation of learning-based UwMVS methods. Furthermore, we propose a novel UwMVS network that enhances geometric cue encoding to achieve more accurate and complete point cloud reconstruction. Extensive experiments on our dataset and real-world underwater scenes demonstrate that our dataset enables the trained models for underwater dense reconstruction and that our method achieves state-of-the-art performance in underwater reconstruction. Dataset, code and appendix are available at: https://cuhk-usr-group.github.io/UwMVS/
|
|
11:35-11:40, Paper WeCT7.5 | |
UR-MVO: Robust Monocular Visual Odometry for Underwater Scenarios |
|
Barhoum, Zein Alabedeen | ITMO University |
Maalla, Yazan | ITMO University |
Daher, Sulieman | ITMO University |
Topolnitskii, Alexander | ITMO University |
Mahmoud, Jaafar | ITMO University |
Kolyubin, Sergey | ITMO University |
Keywords: Marine Robotics, Localization, Object Detection, Segmentation and Categorization
Abstract: Visual odometry (VO) in underwater environments presents significant challenges due to poor visibility and dynamic scene changes, which render conventional (in-air) VO solutions unsuitable for underwater applications. We propose an underwater robust monocular visual odometry (UR-MVO) pipeline tailored for underwater scenarios with feature extraction and matching based on SuperPoint and SuperGlue models, respectively. We enhance the robustness of the feature extractor through field-specific fine-tuning of the SuperPoint model using few-shot unsupervised learning. This tuning was done on real images of underwater scenes in order to enhance its performance in the harsh underwater image conditions. Moreover, we integrate semantic segmentation trained on underwater images into our pipeline to eliminate unreliable features belonging to dynamic objects and background. We evaluated the proposed solution on the Aqualoc dataset, demonstrating higher localization accuracy compared to other SOTA direct and feature-based monocular VO methods like DSO and SVO and also obtained very competitive results compared to more resource-intensive monocular VSLAM approaches with loop closure process like LDSO, UVS, and ORB-SLAM. The results show a high potential for our approach for further applications in underwater exploration and mapping using affordable sensory setups. We publish the code for the benefit of the community https://github.com/be2rlab/UR-MVO
|
|
11:40-11:45, Paper WeCT7.6 | |
SeaSplat: Representing Underwater Scenes with 3D Gaussian Splatting and a Physically Grounded Image Formation Model |
|
Yang, Daniel | Massachusetts Institute of Technology |
Leonard, John | MIT |
Girdhar, Yogesh | Woods Hole Oceanographic Institution |
Keywords: Marine Robotics, Representation Learning, Deep Learning for Visual Perception
Abstract: We introduce SeaSplat, a method to enable real-time rendering of underwater scenes leveraging recent advances in 3D radiance fields. Underwater scenes are challenging visual environments, as rendering through a medium such as water introduces both range and color dependent effects on image capture. We constrain 3D Gaussian Splatting (3DGS), a recent advance in radiance fields enabling rapid training and real-time rendering of full 3D scenes, with a physically grounded underwater image formation model. Applying SeaSplat to the real-world scenes from SeaThru-NeRF dataset, a scene collected by an underwater vehicle in the US Virgin Islands, and simulation-degraded real-world scenes, not only do we see increased quantitative performance on rendering novel viewpoints from the scene with the medium present, but are also able to recover the underlying true color of the scene and restore renders to be without the presence of the intervening medium. We show that the underwater image formation helps learn scene structure, with better depth maps, as well as show that our improvements maintain the significant computational improvements afforded by leveraging a 3D Gaussian representation
|
|
WeCT8 |
311 |
Planinng and Control for Legged Robots 2 |
Regular Session |
Co-Chair: Lin, Pei-Chun | National Taiwan University |
|
11:15-11:20, Paper WeCT8.1 | |
ProNav: Proprioceptive Traversability Estimation for Legged Robot Navigation in Outdoor Environments |
|
Elnoor, Mohamed | University of Maryland |
Sathyamoorthy, Adarsh Jagan | University of Maryland |
Kulathun Mudiyanselage, Kasun Weerakoon | University of Maryland, College Park |
Manocha, Dinesh | University of Maryland |
Keywords: Motion and Path Planning, Vision-Based Navigation, Perception-Action Coupling
Abstract: We propose a novel method, ProNav, which uses proprioceptive signals for traversability estimation in challenging outdoor terrains for autonomous legged robot navigation. Our approach uses sensor data from a legged robot’s joint encoders, force, and current sensors to measure the joint positions, forces, and current consumption respectively to accurately assess a terrain’s stability, resistance to the robot’s motion, risk of entrapment, and crash. Based on these factors, we compute the appropriate robot gait to maximize stability, which leads to reduced energy consumption. Our approach can also be used to predict imminent crashes in challenging terrains and execute behaviors to preemptively avoid them. We integrate ProNav with an exteroceptive-based method to navigate realworld environments with dense vegetation, high granularity, negative obstacles, etc. Our method shows an improvement up to 40% in terms of success rate and up to 15.1% reduction in terms of energy consumption compared to exteroceptive-based methods.
|
|
11:20-11:25, Paper WeCT8.2 | |
MOVE: Multi-Skill Omnidirectional Legged Locomotion with Limited View in 3D Environments |
|
Li, Songbo | Zhejiang University |
Luo, Shixin | Zhejiang University |
Wu, Jun | Zhejiang University |
Zhu, Qiuguo | Zhejiang University |
Keywords: Legged Robots, Machine Learning for Robot Control, Deep Learning for Visual Perception
Abstract: Legged robots possess inherent advantages in traversing complex 3D terrains. However, previous work on low-cost quadruped robots with egocentric vision systems has been limited by a narrow front-facing view and exteroceptive noise, restricting omnidirectional mobility in such environments. While building a voxel map through a hierarchical structure can refine exteroception processing, it introduces significant computational overhead, noise, and delays. In this paper, we present MOVE, a one-stage end-to-end learning framework capable of multi-skill omnidirectional legged locomotion with limited view in 3D environments, just like what a real animal can do. When movement aligns with the robot's line of sight, exteroceptive perception enhances locomotion, enabling extreme climbing and leaping. When vision is obstructed or the direction of movement lies outside the robot's field of view, the robot relies on proprioception for tasks like crawling and climbing stairs. We integrate all these skills into a single neural network by introducing a pseudo-siamese network structure combining supervised and contrastive learning which helps the robot infer its surroundings beyond its field of view. Experiments in both simulations and real-world scenarios demonstrate the robustness of our method, broadening the operational environments for robotics with egocentric vision.
|
|
11:25-11:30, Paper WeCT8.3 | |
Generating Diverse Challenging Terrains for Legged Robots Using Quality-Diversity Algorithm |
|
Esquerre-Pourtère, Arthur | Seoul National University |
Kim, Minsoo | Graduate School of Convergence Science and Technology, Seoul Nat |
Park, Jaeheung | Seoul National University |
Keywords: Legged Robots, Evolutionary Robotics, Failure Detection and Recovery
Abstract: While legged robots have achieved significant advancements in recent years, ensuring the robustness of their controllers on unstructured terrains remains challenging. It requires generating diverse and challenging unstructured terrains to test the robot and discover its vulnerabilities. This topic remains underexplored in the literature. This paper presents a Quality-Diversity framework to generate diverse and challenging terrains that uncover weaknesses in legged robot controllers. Our method, applied to both simulated bipedal and quadruped robots, produces an archive of terrains optimized to challenge the controller in different ways. Quantitative and qualitative analyses show that the generated archive effectively contains terrains that the robots struggled to traverse, presenting different failure modes. Interesting results were observed, including failure cases that were not necessarily expected. Experiments show that the generated terrains can also be used to improve RL-based controllers.
|
|
11:30-11:35, Paper WeCT8.4 | |
Added Mass and Accuracy of the FF-SLIP Model for Legged Swimming |
|
Austin, Max | The University of Tokyo |
Ma, Linna | Florida State University |
Vasquez, Derek A. | Florida State University |
Van Stratum, Brian | Florida State University |
Clark, Jonathan | Florida State University |
Keywords: Legged Robots, Biologically-Inspired Robots, Biomimetics
Abstract: This paper presents the addition of two models for added mass to the Fluid-Field Spring-Loaded Inverted Pendulum (FF-SLIP) Model for legged swimming. The relative ability of these models to capture the increased fluid forces due to virtual mass displacement is evaluated using a two-legged swimming robot, Tadpole. We show that a simple addition to our reduced-order model can predict fluid-leg interaction forces while remaining computationally efficient.
|
|
11:35-11:40, Paper WeCT8.5 | |
A Virtual Gravity Controller for Efficient Underactuated Biped Robots |
|
Maligianni, Despoina | National Technical University of Athens |
Valouxis, Fotios | National Technical University of Athens |
Kantounias, Antonios | National Technical University of Athens |
Smyrli, Aikaterini | National Technical University of Athens, Athena Research Center |
Papadopoulos, Evangelos | National Technical University of Athens |
Keywords: Passive Walking, Underactuated Robots, Humanoid and Bipedal Locomotion
Abstract: This paper introduces a virtual gravity controller for underactuated biped robots. A bio-inspired model of passive bipedal walking is used as the basis for the controller's design. An analytical expression of the controller is obtained, allowing on-line implementations of the developed control scheme. Following a design modification tailored to the controller, the robot is able to reproduce its passive gait even on level-ground. The results are verified via independent high-fidelity physics simulations of the real robot's digital twin. The active robot demonstrates significant dynamic convergence to the passive model's dynamics, with only minor motorization efforts. The developed control scheme showcases robustness and energetic efficiency, and leads the way to a design-oriented approach in active biped locomotion.
|
|
11:40-11:45, Paper WeCT8.6 | |
Stair Climbing of a Transformable Robot Using Varying Leg-Wheel Contact Points |
|
Lai, Yen-Li | National Taiwan University |
Yu, Wei-Shun | National Taiwan University |
Lin, Pei-Chun | National Taiwan University |
Keywords: Legged Robots, Motion Control, Wheeled Robots
Abstract: Staircases are a challenging terrain frequently encountered in urban environments. While leg-wheel robots take advantage of having both legged and wheeled modes, their ability to negotiate stairs still requires careful planning. This paper presents a novel approach to developing a stair-climbing behavior for leg-wheel transformable robots. A comprehensive stair-climbing strategy is constructed by analyzing the workspace of the leg-wheel mechanism, considering the position of the robot’s center of mass, and accounting for foothold displacement owing to the possible leg-wheel forward rolling motion. This strategy enables the robot to safely navigate stairs using its leg-wheel's appropriate parts. Stability during transitions between steps is ensured, and an optimized swing trajectory is proposed to minimize slippage and impact. The approach is validated through simulations and further tested experimentally on staircases with treads of 27 cm and risers of 12 cm, as well as staircases with treads of 24 cm and risers of 14 cm. The experimental results demonstrate the effectiveness and robustness of the proposed method.
|
|
WeCT9 |
312 |
Geometric Foundations |
Regular Session |
Chair: Barfoot, Timothy | University of Toronto |
Co-Chair: Ge, Qiaode | Stony Brook University |
|
11:15-11:20, Paper WeCT9.1 | |
Marginalizing and Conditioning Gaussians Onto Linear Approximations of Smooth Manifolds with Applications in Robotics |
|
Guo, Zi Cong | University of Toronto |
Forbes, James Richard | McGill University |
Barfoot, Timothy | University of Toronto |
Keywords: Probability and Statistical Methods, SLAM, Probabilistic Inference
Abstract: We present closed-form expressions for marginalizing and conditioning Gaussians onto linear manifolds, and demonstrate how to apply these expressions to smooth nonlinear manifolds through linearization. Although marginalization and conditioning onto axis-aligned manifolds are well-established procedures, doing so onto non-axis-aligned manifolds is not as well understood. We demonstrate the utility of our expressions through three applications: 1) approximation of the projected normal distribution, where the quality of our linearized approximation increases as problem nonlinearity decreases; 2) covariance extraction in Koopman SLAM, where our covariances are shown to be consistent on a real-world dataset; and 3) covariance extraction in constrained GTSAM, where our covariances are shown to be consistent in simulation.
|
|
11:20-11:25, Paper WeCT9.2 | |
"Hierarchy of Needs" for Robots: Control Synthesis for Compositions of Hierarchical, Complex Objectives |
|
Lin, Ruoyu | University of California, Irvine |
Egerstedt, Magnus | University of California, Irvine |
Keywords: Hybrid Logical/Dynamical Planning and Verification, Robot Safety, Integrated Planning and Control
Abstract: Drawing inspiration from Maslow's "hierarchy of needs", this paper develops a real-time control synthesis framework for robots to address hierarchical, complex objectives, recognizing that their behaviors are inherently driven by underlying needs. Each need is encoded by the zero-superlevel set of a control barrier function (CBF), which can be time-varying, and all the needs at the same level in a hierarchy are composed into a single one through Boolean compositions of the corresponding CBFs. The effectiveness of the proposed framework is demonstrated through a hypothetical interstellar exploration mission using laboratory robots, and novel results on nonsmooth CBF and time-varying CBF are derived.
|
|
11:25-11:30, Paper WeCT9.3 | |
RM4D: A Combined Reachability and Inverse Reachability Map for Common 6-/7-Axis Robot Arms by Dimensionality Reduction to 4D |
|
Rudorfer, Martin | Aston University |
Keywords: Kinematics, Mobile Manipulation, Industrial Robots
Abstract: Knowledge of a manipulator’s workspace is fundamental for a variety of tasks including robot design, grasp planning and robot base placement. Consequently, workspace representations are well studied in robotics. Two important representations are reachability maps and inverse reachability maps. The former predicts whether a given end-effector pose is reachable from where the robot currently is, and the latter suggests suitable base positions for a desired end-effector pose. Typically, the reachability map is built by discretizing the 6D space containing the robot’s workspace and determining, for each cell, whether it is reachable or not. The reachability map is subsequently inverted to build the inverse map. This is a cumbersome process which restricts the applications of such maps. In this work, we exploit commonalities of existing six and seven axis robot arms to reduce the dimension of the discretization from 6D to 4D. We propose Reachability Map 4D (RM4D), a map that only requires a single 4D data structure for both forward and inverse queries. This gives a much more compact map that can be constructed by an order of magnitude faster than existing maps, with no inversion overheads and no loss in accuracy. Finally, we showcase the efficiency gains by applying RM4D for finding suitable base positions in a scenario with 800 target grasps.
|
|
11:30-11:35, Paper WeCT9.4 | |
An Average-Distance Minimizing Motion Sweep for Bounded Spatial Objects and Its Application in B´ezier-Like Freeform Motion Generation |
|
Liu, Huan | Stony Brook University, SUNY |
Ge, Qiaode | Stony Brook University |
Keywords: Kinematics, Motion and Path Planning, Motion Control
Abstract: This paper uses the ellipsoidal parameters associated with volume moments of inertia of a bounded solid object to construct a motion sweep joining two poses of the solid object, in contrast to earlier works on motion interpolation in SE(3) without taking into account the shape of the moving object. The paper borrows the concept of shape-dependent object norms introduced by Kazerounian and Rastegar and refined by Chirikjian and Zhou to compute as a metric the average of the squared distances (or ASD) among all homologous points of the bounded body between two given poses and seeks to obtain an optimal interpolating motion that minimizes a combination of two ASD distances from each intermediate pose to the two given poses. It is found that the ASD minimizing motion sweep is a novel straight-line motion such that while the centroid of the object follows a straight line, the orientation of the object is constrained so that the ASD metric is minimized. Furthermore, the rotational component can be determined by polar decomposition of the linearly interpolated rotation matrices, scaled by the object's inertia parameters. As an illustration of one of its applications, this motion sweep is repeatedly applied using the de Casteljau algorithm to generate Bézier-like freeform motions, whose paths are in general dependent on the shape of the inertia ellipsoid.
|
|
11:35-11:40, Paper WeCT9.5 | |
Geometric Static Modeling Framework for Piecewise-Continuous Curved-Link Multi Point-Of-Contact Tensegrity Robots |
|
Ervin, Lauren | University of Alabama |
Vikas, Vishesh | University of Alabama |
Keywords: Kinematics, Space Robotics and Automation
Abstract: Tensegrities synergistically combine tensile (cable) and rigid (link) elements to achieve structural integrity, making them lightweight, packable, and impact resistant. Consequently, they have high potential for locomotion in unstructured environments. This research presents geometric modeling of a Tensegrity eXploratory Robot (TeXploR) comprised of two semi-circular, curved links held together by 12 prestressed cables and actuated with an internal mass shifting along each link. This design allows for efficient rolling with stability (e.g., tip-over on an incline). However, the unique design poses static and dynamic modeling challenges given the discontinuous nature of the semi-circular, curved links, two changing points of contact with the surface plane, and instantaneous movement of the masses along the links. The robot is modeled using a geometric approach where the holonomic constraints confirm the experimentally observed four-state hybrid system, proving TeXploR rolls along one link while pivoting about the end of the other. It also identifies the quasi-static state transition boundaries that enable a continuous change in the robot states via internal mass shifting. This is the first time in literature a non-spherical two-point contact system is kinematically and geometrically modeled. Furthermore, the static solutions are closed-form and do not require numerical exploration of the solution. The MATLAB® simulations are experimentally validated on a tetherless prototype with mean absolute error of 4.36° for the arc angles of the points of contact.
|
|
11:40-11:45, Paper WeCT9.6 | |
GISR: Geometric Initialization and Silhouette-Based Refinement for Single-View Robot Pose and Configuration Estimation |
|
Bilic, Ivan | University of Zagreb |
Maric, Filip | University of Toronto Institute for Aerospace Studies |
Bonsignorio, Fabio | FER, University of Zagreb |
Petrovic, Ivan | University of Zagreb |
Keywords: Deep Learning for Visual Perception, Visual Learning, AI-Enabled Robotics
Abstract: In autonomous robotics, measurement of the robot’s internal state and perception of its environment, including interaction with other agents such as collaborative robots, are essential. Estimating the pose of the robot arm from a single view has the potential to replace classical eye-to-hand calibration approaches and is particularly attractive for online estimation and dynamic environments. In addition to its pose, recovering the robot configuration provides a complete spatial understanding of the observed robot that can be used to anticipate the actions of other agents in advanced robotics use cases. Furthermore, this additional redundancy enables the planning and execution of recovery protocols in case of sensor failures or external disturbances. We introduce GISR - a deep configuration and robot-to-camera pose estimation method that prioritizes execution in real-time. GISR consists of two modules: (i) a geometric initialization module that efficiently computes an approximate robot pose and configuration, and (ii) a deep iterative silhouette-based refinement module that arrives at a final solution in just a few iterations. We evaluate GISR on publicly available data and show that it outperforms existing methods of the same class in terms of both speed and accuracy, and can compete with approaches that rely on ground-truth proprioception and recover only the pose. Our code will be available at https://github.com/iwhitey/GISR-robot.
|
|
WeCT10 |
313 |
Multi-Robot Path Planning 3 |
Regular Session |
Chair: Hollinger, Geoffrey | Oregon State University |
Co-Chair: Yu, Jingjin | Rutgers University |
|
11:15-11:20, Paper WeCT10.1 | |
Stop-N-Go: Search-Based Conflict Resolution for Motion Planning of Multiple Robotic Manipulators |
|
Han, Gidon | Sogang University |
Park, Jeongwoo | Sogang University |
Nam, Changjoo | Sogang University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Cooperating Robots
Abstract: We address the motion planning problem for multiple robotic manipulators in packed environments where shared workspace can result in goal positions occupied or blocked by other robots unless those other robots move away to make the goal positions free. While planning in a coupled configuration space (C-space) is straightforward, it struggles to scale with the number of robots and often fails to find solutions. Decoupled planning is faster but frequently leads to conflicts between trajectories. We propose a conflict resolution approach that inserts pauses into individually planned trajectories using an A*search strategy to minimize the makespan--the total time until all robots complete their tasks. This method allows some robots to stop, enabling others to move without collisions, and maintains short distances in the C-space. It also effectively handles cases where goal positions are initially blocked by other robots. Experimental results show that our method successfully solves challenging instances where baseline methods fail to find feasible solutions.
|
|
11:20-11:25, Paper WeCT10.2 | |
Constrained Nonlinear Kaczmarz Projection on Intersections of Manifolds for Coordinated Multi-Robot Mobile Manipulation |
|
Agrawal, Akshaya | Oregon State University |
Mayer, Parker | Oregon State University |
Kingston, Zachary | Purdue University |
Hollinger, Geoffrey | Oregon State University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Constrained Motion Planning, Cooperating Robots
Abstract: Cooperative manipulation tasks impose various structure-, task-, and robot-specific constraints on mobile manipulators. However, current methods struggle to model and solve these myriad constraints simultaneously. We propose a twofold solution: first, we model constraints as a family of manifolds amenable to simultaneous solving. Second, we introduce the constrained nonlinear Kaczmarz (cNKZ) projection technique to produce constraint-satisfying solutions. Experiments show that cNKZ dramatically outperforms baseline approaches, which cannot find solutions at all. We integrate cNKZ with a sampling-based motion planning algorithm to generate complex, coordinated motions for 3--6 mobile manipulators (18--36 DoF), with cNKZ solving up to 80 nonlinear constraints simultaneously and achieving up to a 92% success rate in cluttered environments. We also demonstrate our approach on hardware using three Turtlebot3 Waffle Pi robots with OpenMANIPULATOR-X arms.
|
|
11:25-11:30, Paper WeCT10.3 | |
Targeted Parallelization of Conflict-Based Search for Multi-Robot Path Planning |
|
Guo, Teng | Rutgers University |
Yu, Jingjin | Rutgers University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Motion and Path Planning
Abstract: Multi-Robot Path Planning (MRPP) on graphs, also known as Multi-Agent PathFinding (MAPF), is a well-established NP-hard problem with critically important applications. In (near)-optimally solving MRPP, as serial computation approaches its efficiency limits, parallelization offers a promising route to extend that limit further. As a single solution is unlikely to be successful in addressing all settings, e.g., in handling small/hard or large/sparse MRPP instances, in this study, we explore a targeted parallelization effort to boost the performance of conflict-based search for MRPP. Specifically, when instances are relatively small but robots are densely packed with strong interactions, we devise a decentralized parallel algorithm that concurrently explores multiple branches that leads to markedly enhanced solution discovery. On the other hand, for large problems with sparse robot-robot interactions, we find that prioritizing node expansion and conflict resolution more promising. Our innovative multi-threaded approach to parallelizing bounded-suboptimal conflict search-based algorithms demonstrates significant improvements over baseline serial methods in success rate or runtime. Our work furthers the understanding of MRPP and charts a promising path for elevating solution quality and computational efficiency through parallel algorithmic strategies.
|
|
11:30-11:35, Paper WeCT10.4 | |
Heuristically Guided Compilation for Task Assignment and Path Finding |
|
Chen, Zheng | Zhejiang University |
Chen, Changlin | University of Science and Technology of China |
Yiran, Ni | Zhejiang University |
Wang, Junhao | Hefei University of Technology |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Collision Avoidance, Multi-Robot Systems
Abstract: We investigate the Combined Target-Assignment and Path-Finding (TAPF) problem that computes both task assignments and collision-free paths for multiple agents, that is, each agent is required to select a target from an underlying set, reaching which leads to a payoff. There is a cost closely related to the time required for each agent to reach the goal. The objective is to maximize the minimum gain generated by the agents. We proposed a Compilation-Based Approach with Heuristics (TA-CBWH) to approximate the optimal solution, behind which are two critical ideas: (i) for a specific task assignment, we formulate an integer linear programming (ILP) and create the iteration combined with large neighborhood search (LNS) to improve the solution quality to near-optimal quickly; (ii) regarding distinct task assignments, a switching mechanism is developed to determine the most promising iteration while progressively eliminating unnecessary task assignments. Comparative experiments demonstrate that TA-CBWH outperforms a wide range of existing approaches across various maps and different numbers of agents.
|
|
11:35-11:40, Paper WeCT10.5 | |
Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding |
|
Jiang, He | Carnegie Mellon University |
Wang, Yutong | National University of Singapore |
Veerapaneni, Rishi | Carnegie Mellon University |
Duhan, Tanishq Harish | National University of Singapore |
Sartoretti, Guillaume Adrien | National University of Singapore (NUS) |
Li, Jiaoyang | Carnegie Mellon University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Imitation Learning, Integrated Planning and Learning
Abstract: Lifelong Multi-Agent Path Finding (LMAPF) repeatedly finds collision-free paths for multiple agents that are continually assigned new goals when they reach current ones. Recently, this field has embraced learning-based methods, which reactively generate single-step actions based on individual local observations. However, it is still challenging for them to match the performance of the best search-based algorithms, especially in large-scale settings. This work proposes an imitation-learning-based LMAPF solver that introduces a novel communication module as well as systematic single-step collision resolution and global guidance techniques. Our proposed solver, Scalable Imitation Learning for LMAPF (SILLM), inherits the fast reasoning speed of learning-based methods and the high solution quality of search-based methods with the help of modern GPUs. Across six large-scale maps with up to 10,000 agents and varying obstacle structures, SILLM surpasses the best learning- and search-based baselines, achieving average throughput improvements of 137.7% and 16.0%, respectively. Furthermore, SILLM also beats the winning solution of the 2023 League of Robot Runners, an international LMAPF competition. Finally, we validated SILLM with 10 real robots and 100 virtual robots in a mock warehouse environment.
|
|
11:40-11:45, Paper WeCT10.6 | |
Safety-Guaranteed Distributed Formation Control of Multi-Robot Systems Over Graphs with Rigid and Elastic Edges |
|
Pham, Hoang | Tampere University |
Ranasinghe, Nadun | Tampere University |
Le, Dong | Tampere University |
Atman, Made Widhi Surya | Turku University |
Gusrialdi, Azwirman | Tampere University |
Keywords: Multi-Robot Systems, Collision Avoidance, Distributed Robot Systems
Abstract: This paper considers the problem of formation control of multi-robot systems represented by a graph featuring both rigid and elastic edges, capturing specified range tolerance to the desired inter-robot distances. The objective is to navigate the robots safely through unknown environments with obstacles, utilizing onboard sensors like LiDAR while maintaining inter-robot distance constraints. To this end, a novel cooperative control algorithm is proposed, employing quadratic programming and leveraging control barrier functions to integrate multiple control objectives seamlessly. This approach ensures a unified strategy and provides a safety certificate. Experimental validation of the proposed cooperative control algorithm is conducted using a robotic testbed.
|
|
WeCT11 |
314 |
Safe Control 2 |
Regular Session |
Chair: Bajcsy, Andrea | Carnegie Mellon University |
Co-Chair: Hu, Bin | University of Houston |
|
11:15-11:20, Paper WeCT11.1 | |
Safety-Critical Control with Saliency Detection for Mobile Robots in Dynamic Multi-Obstacle Environments |
|
Zhang, Yu | Technical University of Munich |
Wen, Long | Technical University of Munich |
Hong, Lin | Harbin Institute of Technology |
Zhang, Liding | Technical University of Munich |
Guo, Qun | Technische Universität München |
Li, Shixin | Technical University of Munich |
Bing, Zhenshan | Technical University of Munich |
Knoll, Alois | Tech. Univ. Muenchen TUM |
Keywords: Robot Safety, Robust/Adaptive Control, RGB-D Perception
Abstract: This paper proposes a novel dual-filter architecture utilizing RGB-D camera data and dynamic control barrier functions (D-CBFs) for real-time obstacle avoidance in unstructured environments. The proposed method efficiently handles static, suddenly appearing, and dynamic obstacles, maintaining consistent computational performance across diverse scenarios. To achieve this, two key challenges must be addressed. First, the substantial volume of pixel and depth map data requires robust, real-time processing for efficient D-CBF construction. Second, constructing D-CBFs for each obstacle in multi-obstacle scenarios increases optimization solver time. To address these challenges, we adapt the concept of salient object detection (SOD), proposing an enhanced FastSOD (E-FastSOD) method for rapid risk area identification. This approach rapidly filters out low-risk areas, while high-risk regions are mathematically represented utilizing the proposed enhanced minimal bounding circle (E-MBC) technique. We differentiate static and dynamic obstacles by comparing current and previous MBC states, employing Kalman filtering for obstacle state prediction. This setup enables efficient online D-CBF construction for each MBC, balancing computational speed with accurate obstacle representation. Subsequently, the second filter establishes buffer zones around established D-CBFs, activating only those corresponding to zones the robot actually enters, rather than all D-CBFs to increase real-time performance. We prove the system's safety and asymptotic stabilization under this architecture. Simulated and real-world experiments validate our method, demonstrating an equipped mobile robot's ability to accomplish tasks while ensuring safety across diverse, unknown scenarios.
|
|
11:20-11:25, Paper WeCT11.2 | |
Safe Coverage for Heterogeneous Systems with Limited Connectivity |
|
Taylor, Annalisa T. | Northwestern University |
Berrueta, Thomas | Northwestern University |
Pinosky, Allison | Northwestern University |
Murphey, Todd | Northwestern University |
Keywords: Distributed Robot Systems, Robot Safety, Networked Robots
Abstract: An ongoing challenge for emergency deployments is operating multi-robot teams of diverse agents under communication constraints---where inter-agent connectivity is rare. Thus, heterogeneous systems must autonomously adapt to changing conditions while maintaining safety. In this work, we develop an algorithm for heterogeneous, decentralized multi-robot systems to independently manage safety constraints with provable guarantees for safety and communication for a coverage task. We demonstrate this algorithm in scenarios where up to 100 agents must navigate a simulated cluttered environment with safety constraints that change as agents observe hazards. Further, we show that the performance of a system with a largely disconnected network is equivalent to a fully connected communication network, suggesting that treating connectivity as a constraint may be unnecessary with an appropriate control strategy.
|
|
11:25-11:30, Paper WeCT11.3 | |
Safe Control of Quadruped in Varying Dynamics Via Safety Index Adaptation |
|
Yun, SirkHoo, Kai | Carnegie Mellon University |
Chen, Rui | Carnegie Mellon University; University of Michigan; |
Dunaway, Chase | New Mexico Institute of Mining and Technology |
Dolan, John M. | Carnegie Mellon University |
Liu, Changliu | Carnegie Mellon University |
Keywords: Robot Safety, Robust/Adaptive Control, Legged Robots
Abstract: Varying dynamics pose a fundamental difficulty when deploying safe control laws in the real world. Safety Index Synthesis (SIS) deeply relies on the system dynamics and once the dynamics change, the previously synthesized safety index becomes invalid. In this work, we show the real-time efficacy of Safety Index Adaptation (SIA) in varying dynamics. SIA enables real-time adaptation to the changing dynamics so that the adapted safe control law can still guarantee 1) forward invariance within a safe region and 2) finite time convergence to that safe region. This work employs SIA on a package-carrying quadruped robot, where the payload weight changes in real-time. SIA updates the safety index when the dynamics change, e.g., a change in payload weight, so that the quadruped can avoid obstacles while achieving its performance objectives. Numerical study provides theoretical guarantees for SIA and a series of hardware experiments demonstrate the effectiveness of SIA in real-world deployment in avoiding obstacles under varying dynamics.
|
|
11:30-11:35, Paper WeCT11.4 | |
Updating Robot Safety Representations Online from Natural Language Feedback |
|
Santos, Leonardo | Universidade Federal De Minas Gerais |
Li, Zirui | University of Rochester |
Peters, Lasse | Delft University of Technology |
Bansal, Somil | Stanford University |
Bajcsy, Andrea | Carnegie Mellon University |
Keywords: Robot Safety, AI-Enabled Robotics, Vision-Based Navigation
Abstract: Robots must operate safely when deployed in novel and human-centered environments, like homes. Current safe control approaches typically assume that the safety constraints are known a priori, and thus, the robot can pre-compute a corresponding safety controller. While this may make sense for some safety constraints (e.g., avoiding collision with walls by analyzing a floor plan), other constraints are more complex (e.g., spills), inherently personal, context-dependent, and can only be identified at deployment time when the robot is interacting in a specific environment and with a specific person (e.g., fragile objects, expensive rugs). Here, language provides a flexible mechanism to communicate these evolving safety constraints to the robot. In this work, we use vision language models (VLMs) to interpret language feedback and the robot’s image observations to continuously update the robot’s representation of safety constraints. With these inferred constraints, we update a Hamilton-Jacobi reachability safety controller to efficiently update the robot controller to ensure ongoing safety. Through simulation and hardware experiments, we demonstrate the robot’s ability to infer and respect language-based safety constraints with the proposed approach.
|
|
11:35-11:40, Paper WeCT11.5 | |
Detecting Perception-Based Attacks Using Visual Odometry: Inconsistency Modeling and Checking on Robotic States |
|
Xu, Yuan | Nanyang Technological University |
Deng, Gelei | Nanyang Technological University |
Zhang, Tianwei | Nanyang Technological University |
Keywords: Robot Safety
Abstract: Perception systems in robotic vehicles are crucial for safe and efficient operation, providing key state estimates necessary for planning and control. However, these systems are increasingly vulnerable to perception-based attacks, such as odometry spoofing, position spoofing, obstacle hiding, and object misclassification, which can lead to catastrophic failures. In this paper, we propose a novel approach to detect perception-based attacks by modeling inconsistencies between the physical and estimated states of the robot. Our approach offers a unified methodology for detecting different types of attacks with high accuracy and minimal computational overhead. We validate our method through extensive simulations and real-world scenarios, achieving a 99.5% success rate in detecting attacks, while maintaining a low latency (within 100ms).
|
|
11:40-11:45, Paper WeCT11.6 | |
Distributed Perception Aware Safe Leader Follower System Via Control Barrier Methods |
|
Suganda, Richie Ryulie | University of Houston |
Tran, Tony | University of Houston |
Pan, Miao | University of Houston |
Fan, Lei | University of Houston |
Lin, Qin | University of Houston |
Hu, Bin | University of Houston |
Keywords: Robot Safety, Multi-Robot Systems, Vision-Based Navigation
Abstract: This paper addresses a distributed leader-follower formation control problem for a group of agents, each using a body-fixed camera with a limited field of view (FOV) for state estimation. The main challenge arises from the need to coordinate the agents’ movements with their cameras’ FOV to maintain visibility of the leader for accurate and reliable state estimation. To address this challenge, we propose a novel perception-aware distributed leader-follower safe control scheme that incorporates FOV limits as state constraints. A Control Barrier Function (CBF) based quadratic program is employed to ensure the forward invariance of a safety set defined by these constraints. Furthermore, new neural network based and double bounding boxes based estimators, combined with temporal filters, are developed to estimate system states directly from real-time image data, providing consistent performance across various environments. Comparison results in the Gazebo simulator demonstrate the effectiveness and robustness of the proposed framework in two distinct environments.
|
|
WeCT12 |
315 |
Human-Robot Interaction 4 |
Regular Session |
|
11:15-11:20, Paper WeCT12.1 | |
Gesturing towards Efficient Robot Control: Exploring Sensor Placement and Control Modes for Mid-Air Human-Robot Interaction |
|
Mielke, Tonia | Otto-Von-Guericke University Magdeburg |
Heinrich, Florian | Otto-Von-Guericke University Magdeburg |
Hansen, Christian | Otto-Von-Guericke University Magdeburg |
Keywords: Design and Human Factors, Virtual Reality and Interfaces, Sensor-based Control
Abstract: While collaborative robots effectively combine robotic precision with human capabilities, traditional control methods such as button presses or hand guidance can be slow and physically demanding. This has led to an increasing interest in natural user interfaces that integrate hand gesture-based interactions for more intuitive and flexible robot control. Therefore, this paper systematically explores mid-air robot control by comparing position and rate control modes with different state-of-the-art and novel sensor placements. A user study was conducted to evaluate each combination in terms of accuracy, task duration, perceived workload, and physical exertion. Our results indicate that position control is more efficient than rate control. Traditional desk-mounted sensors can provide a good balance between accuracy and comfort. However, robot-mounted sensors are a viable alternative for short-term, accurate control with less spatial requirements. Leg-mounted sensors, while comfortable, pose challenges to hand-eye coordination. Based on these findings, we provide design implications for improving the usability and comfort of mid-air human-robot interaction. Future research should extend this evaluation to a wider range of tasks and environments.
|
|
11:20-11:25, Paper WeCT12.2 | |
Understanding Dynamic Human-Robot Proxemics in the Case of Four-Legged Canine-Inspired Robots |
|
Xu, Xiangmin | University of Glasgow |
Meng, Zhen | University of Glasgow |
Li, Liying Emma | University of Glasgow |
Khamis, Mohamed | University of Glasgow |
Zhao, Philip Guodong | University of Manchester, UK |
Robin, Bretin | University of Glasgow |
Keywords: Physical Human-Robot Interaction, Social HRI, Safety in HRI
Abstract: The integration of humanoid and animal-shaped robots into specialized domains, such as healthcare, multi-terrain operations, and psychotherapy, necessitates a deep understanding of proxemics—the study of spatial behavior that governs effective human-robot interactions. Unlike traditional robots in manufacturing or logistics, these robots must navigate complex human environments where maintaining appropriate physical and psychological distances is crucial for seamless interaction. This study explores the application of proxemics in human-robot interactions, focusing specifically on quadruped robots, which present unique challenges and opportunities due to their lifelike movement and form. Utilizing a motion capture system, we examine how different interaction postures of a canine robot influence human participants' proxemic behavior in dynamic scenarios. By capturing and analyzing position and orientation data, this research aims to identify key factors that affect proxemic distances and inform the design of socially acceptable robots. The findings underscore the importance of adhering to human psychological and physical distancing norms in robot design, ensuring that autonomous systems can coexist harmoniously with humans.
|
|
11:25-11:30, Paper WeCT12.3 | |
Autonomous Navigation in Crowded Space Using Multi-Sensory Data Fusion |
|
Ananna, Nourin Siddique | BRAC University |
Saif, Mollah Md | BRAC University |
Noor, Maisha | BRAC University |
Awishi, Ishrat Tasnim | BRAC University |
Rahman, Md. Khalilur | BRAC University |
Alam, Md Golam Rabilul | BRAC University |
Keywords: Human-Aware Motion Planning, Data Sets for Robot Learning, Sensor Fusion
Abstract: Autonomous navigation in crowded environments remains a significant challenge due to the highly dynamic and unpredictable nature of pedestrian movements. This paper presents a novel approach for socially-compliant crowd navigation by leveraging human pose tracking, trajectory prediction, and obstacle avoidance techniques. We introduce PoseTrajNet, an end-to-end autonomous agent navigation pipeline that integrates YOLOv8 for object detection, BlazePose for real-time human pose estimation, and a custom trajectory prediction model drawing on concepts from Social GANs. PoseTrajNet employs pose keypoints as socially-compliant features to anticipate pedestrian trajectories, enabling proactive path planning and dynamic safe radius adjustments for obstacle avoidance. Extensive evaluations on standard datasets demonstrate PoseTrajNet's effectiveness in seamless crowd navigation, outperforming baselines while adhering to social norms.
|
|
11:30-11:35, Paper WeCT12.4 | |
Feasibility-Aware Imitation Learning from Observation through a Hand-Mounted Demonstration Interface |
|
Takahashi, Kei | Nara Institute of Science and Technology |
Sasaki, Hikaru | Nara Institute of Science and Technology |
Matsubara, Takamitsu | Nara Institute of Science and Technology |
Keywords: Imitation Learning, Learning from Demonstration
Abstract: Imitation learning through a demonstration interface is expected to learn policies for robot automation from intuitive human demonstrations. However, due to the differences in human and robot movement characteristics, a human expert might unintentionally demonstrate an action that the robot cannot execute. We propose feasibility-aware behavior cloning from observation (FABCO). In the FABCO framework, the feasibility of each demonstration is assessed using the robot's pre-trained forward and inverse dynamics models. This feasibility information is provided as visual feedback to the demonstrators, encouraging them to refine their demonstrations. During policy learning, estimated feasibility serves as a weight for the demonstration data, improving both the data efficiency and the robustness of the learned policy. We experimentally validated FABCO's effectiveness by applying it to a pipette insertion task involving a pipette and a vial. Four participants assessed the impact of the feasibility feedback and the weighted policy learning in FABCO. Additionally, we used the NASA Task Load Index (NASA-TLX) to evaluate the workload induced by demonstrations with visual feedback.
|
|
11:35-11:40, Paper WeCT12.5 | |
Human-Robot Collaboration for the Remote Control of Mobile Humanoid Robots with Torso-Arm Coordination |
|
Boguslavskii, Nikita | Worcester Polytechnic Institute (WPI) |
Genua, Lorena Maria | Worcester Polytechnic Institute |
Li, Zhi | Worcester Polytechnic Institute |
Keywords: Telerobotics and Teleoperation, Human-Robot Collaboration, Human Factors and Human-in-the-Loop
Abstract: Recently, many humanoid robots have been increasingly deployed in various facilities, including hospitals and assisted living environments, where they are often remotely controlled by human operators. Their kinematic redundancy enhances reachability and manipulability, enabling them to navigate complex, cluttered environments and perform a wide range of tasks. However, this redundancy also presents significant control challenges, particularly in coordinating the movements of the robot's macro-micro structure (torso and arms). Therefore, we propose various human-robot collaborative (HRC) methods for coordinating the torso and arm of remotely controlled mobile humanoid robots, aiming to balance autonomy and human input to enhance system efficiency and task execution. The proposed methods include human-initiated approaches, where users manually control torso movements, and robot-initiated approaches, which autonomously coordinate torso and arm based on factors such as reachability, task goal, or inferred human intent. We conducted a user study with N=17 participants to compare the proposed approaches in terms of task performance, manipulability, and energy efficiency, and analyzed which methods were preferred by participants.
|
|
11:40-11:45, Paper WeCT12.6 | |
Soft Human-Robot Handover Using a Vision-Based Pipeline |
|
Castellani, Chiara | Istituto Italiano Di Tecnologia |
Turco, Enrico | Istituto Italiano Di Tecnologia |
Bo, Valerio | Istituto Italiano Di Tecnologia |
Malvezzi, Monica | University of Siena |
Prattichizzo, Domenico | University of Siena |
Costante, Gabriele | University of Perugia |
Pozzi, Maria | University of Siena |
Keywords: Grasping, Soft Robot Applications, Physical Human-Robot Interaction
Abstract: Handing over objects is an essential task in human-robot collaborative scenarios. Previous studies have predominantly employed rigid grippers to perform the handover, focusing their efforts on the generation of grasps that avoid physical contact with people. In this paper, instead, we present a vision-based open-palm handover solution where a soft robotic hand exploits on purpose the contact with the human hand for improved grasp success and robustness. In particular, the human-robot physical interaction allows the robotic hand to slide over the human palm surface and firmly cage the object. The identification of the human hand plane and the object pose is achieved through a versatile perception pipeline that exploits a single RGB-D camera. Through several experimental trials we show that the system achieves successful grasps over multiple objects with different geometries and textures. We also conduct a comparative analysis between the proposed soft handover method and a baseline approach, evaluating their robustness to uncertainties on the object position. Lastly, a user study with 30 participants is conducted to evaluate the users’ perception of the human-robot interaction during the handover. Obtained results highlight the effectiveness of the proposed pipeline with different users and an overall users’ preference for the soft handover.
|
|
WeCT13 |
316 |
Soft Robotic Grasping 2 |
Regular Session |
Chair: Wang, Wei | University of Wisconsin-Madison |
Co-Chair: Vikas, Vishesh | University of Alabama |
|
11:15-11:20, Paper WeCT13.1 | |
Utilizing Bioinspired Soft Modular Appendages for Grasping and Locomotion in Multi-Legged Robots on Ground and Underwater |
|
Siddiquee, Abu Nayem Md. Asraf | Graduate Teaching Assistant - University of Notre Dame |
Ozkan-Aydin, Yasemin | University of Notre Dame |
Keywords: Soft Robot Applications, Biologically-Inspired Robots, Soft Sensors and Actuators
Abstract: Soft robots can adapt to their environments, which makes them suitable for deploying in disaster areas and agricultural fields, where their mobility is constrained by complex terrain. One of the main challenges in developing soft terrestrial robots is that the robot must be soft enough to adapt to its environment, but also rigid enough to exert adequate force on the ground to locomote. In this letter, we report a pneumatically driven, soft modular appendage made of silicone for a terrestrial robot capable of generating specific mechanical movement to locomote in the desired direction. We used Finite Element Analysis (FEA) simulations to assess the soft leg’s bending behavior, validated against the physical leg. In addition, we performed blocked force analysis to understand its force generation capabilities. We developed a soft-rigid- bodied tethered robot prototype and tested it on the ground and underwater environments to evaluate its locomotion performance. The robot demonstrated successful forward and backward movement as well as left and right turns, both on the ground and underwater. We explored the object manipulation and transportation capability of the robot by adding two additional soft appendages as a gripper. The robot demonstrated its ability to effectively manipulate and transport objects of varying nature, including rigid items such as a 3D-printed plastic box and fragile objects like an egg. The maximum load-carrying capacity of the robot was also investigated both on the ground and the aquatic medium. Our design approach provides a straightforward, cost-effective, and efficient method for creating versatile soft appendages for a robot that is capable of terradynamic locomotion. This approach showcases its potential applicability in underwater search and rescue missions.
|
|
11:20-11:25, Paper WeCT13.2 | |
Design of a Novel Pneumatic Soft Gripper for Robust Adaptive Grasping |
|
Sun, Xiantao | Anhui University |
Zhong, Mingsheng | Anhui University |
Tang, Zhouzheng | Anhui University |
Chen, Wenjie | Anhui University |
Chen, Weihai | Beihang University |
Keywords: Grippers and Other End-Effectors, Mechanism Design, Soft Sensors and Actuators
Abstract: Soft grippers have shown promising performance in safe and adaptive grasping tasks. However, they often suffer from limitations in grasping force. To address this challenge, this paper presents a novel pneumatic three-finger soft gripper to achieve robust adaptive grasping. The gripper consists of three identical fingers, each containing a pneumatic bending soft actuator and a pneumatic lateral soft actuator. The bending actuator features a tilted pneumatic network structure, which provides superior bending performance compared to traditional vertical pneumatic network structure. The lateral actuator is equipped with three deflection chambers at the finger root to mimic the lateral motions of a human finger. Kinematic and static models are established to predict the bending angle and grasping force of the soft finger under pressurized air. The performance of the proposed soft finger is analyzed through finite element simulations, and the effect of the chamber tilt angle is also examined. The theoretical and simulation results are compared to verify the validity of the analytical models. Finally, the proposed soft gripper is fabricated by 3D printing and molding. Experimental results show that the gripper is capable of grasping various objects of different sizes, shapes, materials, and weights, and can perform dexterous manipulation tasks, such as cap unscrewing. The proposed soft gripper exhibits significant potential for applications in robotic robust grasping tasks.
|
|
11:25-11:30, Paper WeCT13.3 | |
Hybrid Gripper with Passive Pneumatic Soft Joints for Grasping Deformable Thin Objects |
|
Tran, Duy | Ha Noi University of Science and Technology |
Ly, Hoang Hiep | Hanoi University of Science and Technology |
Nguyen, Thuan | Hanoi University of Science and Technology |
Mac, Thi Thoa | HUST |
Nguyen, Anh | University of Liverpool |
Ta, Tung D. | The University of Tokyo |
Keywords: Mechanism Design, Grippers and Other End-Effectors, Soft Robot Materials and Design
Abstract: Grasping a variety of objects remains a key challenge in the development of versatile robotic systems. The human hand is remarkably dexterous, capable of grasping and manipulating objects with diverse shapes, mechanical properties, and textures. Inspired by how humans use two fingers to pick up thin and large objects such as fabric or sheets of paper, we aim to develop a gripper optimized for grasping such deformable objects. Observing how the soft and flexible fingertip joints of the hand approach and grasp thin materials, a hybrid gripper design that incorporates both soft and rigid components was proposed. The gripper utilizes a soft pneumatic ring wrapped around a rigid revolute joint to create a flexible two-fingered gripper. Experiments were conducted to characterize and evaluate the gripper's performance in handling sheets of paper and other objects. Compared to rigid grippers, the proposed design improves grasping efficiency and reduces the gripping distance by up to eightfold.
|
|
11:30-11:35, Paper WeCT13.4 | |
Dexterous Three-Finger Gripper Based on Offset Trimmed Helicoids |
|
Guan, Qinghua | Harbin Institute of Technology |
Cheng, Hung Hon | EPFL |
Hughes, Josie | EPFL |
Keywords: Grippers and Other End-Effectors, Soft Sensors and Actuators, Soft Robot Applications
Abstract: This study presents an innovative offset-trimmed helicoids (OTH) structure, featuring a tunable deformation center that emulates the flexibility of human fingers. This design significantly reduces the actuation force needed for larger elastic deformations, particularly when dealing with harder materials like thermoplastic polyurethane (TPU). The incorporation of two helically routed tendons within the finger enables both in-plane bending and lateral out-of-plane transitions, effectively expanding its workspace and allowing for variable curvature along its length. Compliance analysis indicates that the compliance at the fingertip can be fine-tuned by adjusting the mounting placement of the fingers. This customization enhances the gripper's adaptability to a diverse range of objects. By leveraging TPU's substantial elastic energy storage capacity, the gripper is capable of dynamically rotating objects at high speeds, achieving approximately 60° in just 15 milliseconds. The three-finger gripper, with its high dexterity across six degrees of freedom, has demonstrated the capability to successfully perform intricate tasks. One such example is the adept spinning of a rod within the gripper's grasp.
|
|
11:35-11:40, Paper WeCT13.5 | |
Improving Grip Stability Using Passive Compliant Microspine Arrays for Soft Robots in Unstructured Terrain |
|
Ervin, Lauren | University of Alabama |
Bezawada, Harish | The University of Alabama |
Vikas, Vishesh | University of Alabama |
Keywords: Compliant Joints and Mechanisms, Soft Robot Materials and Design, Field Robots
Abstract: Microspine grippers are small spines commonly found on insect legs that reinforce surface interaction by engaging with asperities to increase shear force and traction. An array of such microspines, when integrated into the limbs or undercarriage of a robot, can provide the ability to maneuver uneven terrains, traverse inclines, and even climb walls. Meanwhile, the conformability and adaptability of soft robots makes them ideal candidates for applications involving traversal of complex, unstructured terrains. However, there remains a real-life realization gap for soft locomotors pertaining to their transition from controlled lab environment to the field that can be bridged by improving grip stability through effective integration of microspines. In this research, a passive, compliant microspine stacked array design is proposed to enhance the locomotion capabilities of mobile soft robots. A microspine array integration method effectively addresses the stiffness mismatch between soft, compliant, and rigid components. Additionally, a reduction in complexity results from actuation of the surface-conformable soft limb using a single actuator. The two-row, stacked microspine array configuration offers improved gripping capabilities on steep and irregular surfaces. This design is incorporated into three different robot configurations - the baseline without microspines and two others with different combinations of microspine arrays. Field experiments are conducted on surfaces of varying surface roughness and non-uniformity - concrete, brick, compact sand, and tree roots. Experimental results demonstrate that the inclusion of microspine arrays increases planar displacement an average of 10 times. The improved grip stability, repeatability, and, terrain traversability is reflected by a decrease in the relative standard deviation of the locomotion gaits.
|
|
11:40-11:45, Paper WeCT13.6 | |
Hybrid Soft Pneumatic and Tendon Actuated Finger with Selective Locking Chain Link Joints |
|
Lin, Keng-Yu | University of Wisconsin Madison |
Stonecipher, Jack | University of Wisconsin-Madison |
Rusch, Zach | University of Wisconsin-Madison |
Wang, Wei | University of Wisconsin-Madison |
Wehner, Michael | University of Wisconsin, Madison |
Keywords: Soft Robot Applications, Grippers and Other End-Effectors, Grasping
Abstract: Rigid robots excel in structured conditions, but struggle in more unpredictable or populated environments. Soft robots address these difficulties, but the compliance which gives them their inherent safety also limits their ability to apply desired forces. Jamming/locking reduces this back-drivability but does not allow for directional application of force. We present a hybrid system including pneumatic and tendon actuation as well as a system of cable-driven locking modules, able lock individual joints. Combining these mechanisms yields a device which can behave as: a soft finger, a fully-rigid finger, and a locally-locking finger which mimics a traditional rigid-link robot. This finger is able to switch between these behaviors on-the-fly, allowing it to adapt to unexpected scenarios, critical for social robots. Using these modes and the ability to adapt real-time, our finger is able to complete common household tasks, difficult for current robots. We characterize the finger’s ability to resist force in three actuation modes (Pneumatic, Cable, and Locked), its ability to apply force, and its ability to actuate in 31 different configurations (plus a static all-locked configuration). We also present a demonstration in which the finger conforms to the shape of a computer mouse then clicks a mouse button, and of the finger conforms to the shape of a heavy door handle, then pulling it open. We present the design, fabrication, and characterization of this finger as a demonstration of the underlying concept, which can be broadly applied to social robotics.
|
|
WeCT14 |
402 |
Reconfigurable Robots |
Regular Session |
|
11:15-11:20, Paper WeCT14.1 | |
Enabling Framework for Constant Complexity Model in Autonomous Inter-Reconfigurable Robots (I) |
|
Wan, Ash Yaw Sang | Singapore University of Technology and Design |
Le, Anh Vu | Communication and Signal Processing Research Group Faculty of El |
Moo, Chee Gen | Singapore University of Technology and Design |
Sivanantham, Vinu | Singapore University of Technology and Design |
Elara, Mohan Rajesh | Singapore University of Technology and Design |
Keywords: Cellular and Modular Robots, Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents
Abstract: In reconfigurable robotics, intra-reconfiguration enables a robot to change its functional abilities, while inter-reconfiguration manipulates the specification limits of the robot hardware. Although the versatility of inter-reconfigurable robots is desired in advanced autonomous systems, the O(n^3) algorithm computational time complexity challenge comes when multiple modular robots combine and reconfigure into a bigger form structure for autonomous navigation tasks. This phenomenon has limited the inter-reconfiguration potential of expansion, versatility, and robustness. In this paper, a navigation framework with non-complex transformation states is proposed for inter-reconfigurable robots to perform combining and splitting control dimensions. Simulations have shown the complexity from O(n) to constant time O(1) in the reconfiguration states of the framework on a considerable number of robot agents. Additionally, a set of inter-reconfigurable robots, Wasp Biggie, was used to demonstrate the proof-of-concept in experiments as a fully functional centralized planner system. These experiments showed outperforming results on the consistent utility of CPU consumption while performing navigation and reconfiguration.
|
|
11:20-11:25, Paper WeCT14.2 | |
Improving Coverage Performance of a Size-Reconfigurable Robot Based on Overlapping and Reconfiguration Reduction Criteria |
|
Muthugala Arachchige, Viraj Jagathpriya Muthugala | Singapore University of Technology and Design |
Samarakoon Mudiyanselage, Bhagya Prasangi Samarakoon | Singapore University of Technology and Design |
Wijegunawardana, Isira Damsith | Singapore University of Technology |
Elara, Mohan Rajesh | Singapore University of Technology and Design |
Keywords: Motion and Path Planning, Planning under Uncertainty, Neural and Fuzzy Control
Abstract: Size reconfigurable robots have been introduced for coverage applications to improve performance. The size reconfiguration ability allows a robot to access narrow areas in a smaller size while covering open spaces in a larger size, improving productivity. This paper proposes a novel CPP method consisting of an Overlapping Reduction Criterion (ORC) and a Reconfiguration Reduction Criterion (RRC) for a size-reconfigurable robot to improve performance in dynamic workspaces. A Glasius Bio-inspired Neural Network (GBNN) is adapted to guide the robot toward unvisited cells considering neural activity variation. The size variation is managed by utilizing a collection of grid maps generated for various size configurations of the robot. The RRC and ORC penalize the movements requiring size reconfigurations or creating isolated unvisited regions in the decision-making process of next movement selection yielding to reduce reconfigurations and overlapping. According to the results, the proposed CPP method surpasses state of the art in terms of performance indexes reconfiguration count, overlapping, path distance, and coverage time by significant margins.
|
|
11:25-11:30, Paper WeCT14.3 | |
CoCube: A Tabletop Modular Multi-Robot Platform for Education and Research |
|
Liang, Shuai | Fudan University |
Zhu, Songyi | Shanghai Artifcial Intelligence Laboratory |
Zhonghan, Tang | University of Science and Technology of China |
Li, Chenhui | Shanghai Artificial Intelligence Laboratory |
Wu, Wenjie | DynaLab |
Han, Jialing | Fudan University |
Lin, Zemin | Shanghai Jiaotong University |
You, Zhongrui | Shanghai Artifcial Intelligence Laboratory |
Maloney, John | MicroBlocks |
Romagosa Carrasquer, Bernat | SAP |
Zhao, Bin | Northwestern Polytechnical University |
Wang, Zhigang | Shanghai AI Laboratory |
Zhang, Zhinan | Shanghai Jiao Tong University |
Li, Xuelong | Northwestern Polytechnical University |
Keywords: Multi-Robot Systems, Education Robotics, Cellular and Modular Robots
Abstract: This paper presents CoCube, a tabletop modular robotics platform designed for robotics education and multi-robot algorithm research. CoCube is characterized by its low cost, low floors, high ceilings and wide walls, offering flexibility and broad applicability across various use cases. The platform comprises four key components: CoCube robots, which integrate wireless communication, movement and interaction; CoModules, which provide versatile external functionality; CoMaps, which enable high-precision localization via microdot patterns on regular printed paper; and CoTags for interaction. CoCube operates on MicroBlocks, a blocks programming language for physical computing inspired by Scratch, a widely-used coding language with a simple visual interface that makes programming accessible to young learners. It offers users both flexibility and ease of use, with advanced API support for more complex applications. This paper details the design of the CoCube platform and demonstrates its potential in both educational and research contexts.
|
|
11:30-11:35, Paper WeCT14.4 | |
Loopy Movements: Emergence of Rotation in a Multicellular Robot |
|
Smith, Trevor | West Virginia University |
Gu, Yu | West Virginia University |
Keywords: Cellular and Modular Robots, Swarm Robotics, Biologically-Inspired Robots
Abstract: Unlike most human-engineered systems, many biological systems rely on emergent behaviors from low-level interactions, enabling greater diversity and superior adaptation to complex, dynamic environments. This study explores emergent decentralized rotation in the Loopy multicellular robot, composed of homogeneous, physically linked, 1-degree-of-freedom cells. Inspired by biological systems like sunflowers, Loopy uses simple local interactions—diffusion, reaction, and active transport of simulated chemicals, called morphogens—without centralized control or knowledge of its global morphology. Through these interactions, the robot self-organizes to achieve coordinated rotational motion and forms lobes—local protrusions created by clusters of motor cells. This study investigates how these interactions drive Loopy’s rotation, the impact of its morphology, and its resilience to actuator failures. Our findings reveal two distinct behaviors: 1) inner valleys between lobes rotate faster than the outer peaks, contrasting with rigid body dynamics, and 2) cells rotate in the opposite direction of the overall morphology. The experiments show that while Loopy’s morphology does not affect its angular velocity relative to its cells, larger lobes increase cellular rotation and decrease morphology rotation relative to the environment. Even with up to one-third of its actuators disabled and significant morphological changes, Loopy maintains its rotational abilities, highlighting the potential of decentralized, bio-inspired strategies for resilient and adaptable robotic systems.
|
|
11:35-11:40, Paper WeCT14.5 | |
Enhancing Connection Strength in Freeform Modular Reconfigurable Robots through Holey Sphere and Gripper Mechanisms |
|
Wang, Peiqi | The Chinese University of Hong Kong, Shenzhen |
Liang, Guanqi | The Chinese University of Hong Kong, Shenzhen |
Zhao, Da | The Chinese University of Hong Kong |
Lam, Tin Lun | The Chinese University of Hong Kong, Shenzhen |
Keywords: Cellular and Modular Robots, Mechanism Design, Distributed Robot Systems
Abstract: Freeform modular self-reconfigurable robot (MSRR) systems overcome traditional docking limitations, enabling rapid and continuous connections between modules in any direction. Recent advancements in freeform MSRR technology have significantly enhanced connectivity and mobility. However, limitations in connector strength and operational efficiency in existing designs restrict performance. This paper proposes a rigid freeform connector and a rigid magnetic track design to improve the connection and motion performance of the SnailBot. Each SnailBot is equipped with a multi-channel rope-driven gripper, a metal spherical shell with densely distributed circular holes on the back, and a rigid chain design conforming to the spherical surface. This combination allows each SnailBot to move precisely along the surface of a peer, facilitated by the ferromagnetic spherical shell and magnetic track. The integration of the gripper and spherical shell hole array provides robust inter-module connections in any position and orientation. The effectiveness of these designs has been validated through a series of experiments and analyses, demonstrating improved connection and motion performance in the SnailBot dual-mode connector system and expanding its potential applications and functional capabilities.
|
|
WeCT15 |
403 |
Bimanual Manipulation 2 |
Regular Session |
Chair: Asfour, Tamim | Karlsruhe Institute of Technology (KIT) |
Co-Chair: Gupta, Satyandra K. | University of Southern California |
|
11:15-11:20, Paper WeCT15.1 | |
Learning Dexterous Bimanual Catch Skills through Adversarial-Cooperative Heterogeneous-Agent Reinforcement Learning |
|
Kim, Taewoo | Electronics and Telecommunications Research Institute |
Yoon, Youngwoo | Electronics and Telecommunications Research Institute |
Kim, Jaehong | ETRI |
Keywords: Bimanual Manipulation, Reinforcement Learning, Multifingered Hands
Abstract: Robotic catching has traditionally focused on single-handed systems, which are limited in their ability to handle larger or more complex objects. In contrast, bimanual catching offers significant potential for improved dexterity and object handling but introduces new challenges in coordination and control. In this paper, we propose a novel framework for learning dexterous bimanual catching skills using Heterogeneous-Agent Reinforcement Learning (HARL). Our approach introduces an adversarial reward scheme, where a throw agent increases the difficulty of throws adjusting speed while a catch agent learns to coordinate both hands to catch objects under these evolving conditions. We evaluate the framework in simulated environments using 15 different objects, demonstrating robustness and versatility in handling diverse objects. Our method achieved approximately a 2x increase in catching reward compared to single-agent baselines across 15 diverse objects.
|
|
11:20-11:25, Paper WeCT15.2 | |
Flat'n'Fold: A Diverse Multi-Modal Dataset for Garment Perception and Manipulation |
|
Zhuang, Lipeng | University of Glasgow |
Fan, Shiyu | University of Glasgow |
Ru, Yingdong | University of Glasgow |
Audonnet, Florent | University of Glasgow |
Henderson, Paul | University of Glasgow |
Aragon-Camarasa, Gerardo | University of Glasgow |
Keywords: Data Sets for Robotic Vision, Data Sets for Robot Learning, Bimanual Manipulation
Abstract: We present Flat'n'Fold, a novel large-scale dataset for garment manipulation that addresses critical gaps in existing datasets. Comprising 1,212 human and 887 robot demonstrations of flattening and folding 44 unique garments across 8 categories, Flat'n'Fold surpasses prior datasets in size, scope, and diversity. Our dataset uniquely captures the entire manipulation process from crumpled to folded states, providing synchronized multi-view RGB-D images, point clouds, and action data, including hand or gripper positions and rotations. We quantify the dataset's diversity and complexity compared to existing benchmarks and show that our dataset features natural and diverse manipulations of real-world demonstrations of human and robot demonstrations in terms of visual and action information. To showcase Flat'n'Fold utility, we establish new benchmarks for grasping point prediction and subtask decomposition. Our evaluation of state-of-the-art models on these tasks reveals significant room for improvement. This underscores Flat'n'Fold's potential to drive advances in robotic perception and manipulation of deformable objects. Our dataset can be downloaded at https://cvas-ug.github.io/flat-n-fold
|
|
11:25-11:30, Paper WeCT15.3 | |
TWIN: Two-Handed Intelligent Benchmark for Bimanual Manipulation |
|
Grotz, Markus | University of Washington (UW) |
Shridhar, Mohit | University of Washington |
Chao, Yu-Wei | NVIDIA |
Asfour, Tamim | Karlsruhe Institute of Technology (KIT) |
Fox, Dieter | University of Washington |
Keywords: Bimanual Manipulation, Software Tools for Benchmarking and Reproducibility, Imitation Learning
Abstract: Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by presenting a benchmark for bimanual manipulation. A key functionality is the ability to autonomously generate training data without the necessity of human demonstrations to the robot. We open-source our code and benchmark, which comprises 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To initiate the benchmark, we extended multiple state-of-the-art techniques to the domain of bimanual manipulation. The project website with code is available at: http://bimanual.github.io.
|
|
11:30-11:35, Paper WeCT15.4 | |
Active Vision Might Be All You Need: Exploring Active Vision in Bimanual Robotic Manipulation |
|
Chuang, Ian | University of California, Davis |
Lee, Andrew | University of California, Davis |
Gao, Dechen | University of California, Davis |
Naddaf Shargh, Mohammad Mahdi | University of California - Davis |
Soltani, Iman | University of California, Davis |
Keywords: Perception for Grasping and Manipulation, Dual Arm Manipulation, Dexterous Manipulation
Abstract: Imitation learning has demonstrated significant potential in performing high-precision manipulation tasks using visual feedback. However, it is common practice in imitation learning for cameras to be fixed in place, resulting in issues like occlusion and limited field of view. Furthermore, cameras are often placed in broad, general locations, without an effective viewpoint specific to the robot's task. In this work, we investigate the utility of active vision (AV) for imitation learning and manipulation, in which, in addition to the manipulation policy, the robot learns an AV policy from human demonstrations to dynamically change the robot's camera viewpoint to obtain better information about its environment and the given task. We introduce AV-ALOHA, a new bimanual teleoperation robot system with AV, an extension of the ALOHA 2 robot system, incorporating an additional 7-DoF robot arm that only carries a stereo camera and is solely tasked with finding the best viewpoint. This camera streams stereo video to an operator wearing a virtual reality (VR) headset, allowing the operator to control the camera pose using head and body movements. The system provides an immersive teleoperation experience, with bimanual first-person control, enabling the operator to dynamically explore and search the scene and simultaneously interact with the environment. We conduct imitation learning experiments of our system both in real-world and in simulation, across a variety of tasks that emphasize viewpoint planning. Our results demonstrate the effectiveness of human-guided AV for imitation learning, showing significant improvements over fixed cameras in tasks with limited visibility. Project website: https://soltanilara.github.io/av-aloha/
|
|
11:35-11:40, Paper WeCT15.5 | |
Force-Conditioned Diffusion Policies for Compliant Sheet Separation Tasks in Bimanual Robotic Cells |
|
Shukla, Rishabh | University of Southern California |
Talan, Raj | University of Southern California |
Moode, Samrudh | University of Southern California |
Dhanaraj, Neel | University of Southern California |
Kang, Jeon Ho | University of Southern California |
Gupta, Satyandra K. | University of Southern California |
Keywords: Learning from Demonstration, Bimanual Manipulation, Disassembly
Abstract: Disassembly is a critical challenge in maintenance and service tasks, particularly in high-precision operations such as electric vehicle (EV) battery recycling. Tasks like prying-open sealed battery covers require precise manipulation and controlled force application. In our approach we collect human demonstrations using a motion capture system, enabling the robot to learn from human-expert disassembly strategies. These demonstrations train a bimanual robotic system in which one arm exerts force with a specialized tool while the other manipulates and removes sealed components. Our method builds on a diffusion-based policy and integrates real-time force sensing to adapt its actions as contact conditions change. We decompose the demonstrations into distinct sub-tasks and apply data augmentation, thereby reducing the number of demonstrations needed and mitigating potential task failures. Our results show that the proposed method, even with a small dataset, achieves a high task success rate and efficiency compared to a standard diffusion technique. We demonstrate in a real-world application that the bimanual system effectively executes chiseling and peeling actions to separate bonded sheet from a substrate.
|
|
11:40-11:45, Paper WeCT15.6 | |
A Comparison of Imitation Learning Algorithms for Bimanual Manipulation |
|
Drolet, Michael | Technische Universität Darmstadt |
Stepputtis, Simon | Carnegie Mellon University |
Kailas, Siva | Carnegie Mellon University |
Jain, Ajinkya | Intrinsic Innovation LLC |
Peters, Jan | Technische Universität Darmstadt |
Schaal, Stefan | Google X |
Ben Amor, Heni | Arizona State University |
Keywords: Imitation Learning, Bimanual Manipulation, Learning from Demonstration
Abstract: Amidst the wide popularity of imitation learning algorithms in robotics, their properties regarding hyperparameter sensitivity, ease of training, data efficiency, and performance have not been well-studied in high-precision industry-inspired environments. In this work, we demonstrate the limitations and benefits of prominent imitation learning approaches and analyze their capabilities regarding these properties. We evaluate each algorithm on a complex bimanual manipulation task involving an over-constrained dynamics system in a setting involving multiple contacts between the manipulated object and the environment. While we find that imitation learning is well suited to solve such complex tasks, not all algorithms are equal in terms of handling environmental and hyperparameter perturbations, training requirements, performance, and ease of use. We investigate the empirical influence of these key characteristics by employing a carefully designed experimental procedure and learning environment.
|
|
WeCT16 |
404 |
Grasping 2 |
Regular Session |
Chair: Jia, Yan-Bin | Iowa State University |
Co-Chair: Spenko, Matthew | Illinois Institute of Technology |
|
11:15-11:20, Paper WeCT16.1 | |
Trajectory Optimization for Dynamically Grasping Irregular Objects |
|
Vu, Minh Nhat | TU Wien, Austria |
Grander, Florian | EGGER Holzwerkstoffe Brilon GmbH |
Nguyen, Anh | University of Liverpool |
Unger, Christoph | TU Wien |
Keywords: Industrial Robots, Motion and Path Planning
Abstract: This paper presents a novel trajectory optimization framework for grasping a thin object with the schunk (SDH2) hand-mounted on a Kuka robot. Unlike a conventional grasping task, we aim to achieve a ``dynamic grasp'' of the object, which requires continuous movement during the grasping process. The trajectory framework comprises two phases. Firstly, in a specified time limit of SI{10}{second}, initial offline trajectories are computed for a seamless motion from an initial configuration of the robot to grasp the object and deliver it to a pre-defined target location. Secondly, fast online trajectory optimization is implemented to update robot trajectories in real time within 100 milliseconds. This helps to mitigate pose estimation errors from the vision system. To account for model inaccuracies, disturbances, and other non-modeled effects, trajectory tracking controllers for both the robot and the gripper are implemented to execute the optimal trajectories from the proposed framework. Simulation and experimental results effectively demonstrate the performance of the trajectory planning framework in real-world scenarios.
|
|
11:20-11:25, Paper WeCT16.2 | |
DistillGrasp: Integrating Features Correlation with Knowledge Distillation for Depth Completion of Transparent Objects |
|
Huang, Yiheng | Guangdong University of Technology |
Chen, Junhong | Guangdong University of Technology |
Michiels, Nick | Hasselt University - Flanders Make - Expertise Centre for Digita |
Asim, Muhammad | Guangdong University of Technology |
Claesen, Luc | Hasselt Univeristy |
Liu, Wenyin | Guangdong University of Technology |
Keywords: Deep Learning in Grasping and Manipulation, Perception for Grasping and Manipulation, Grasping
Abstract: Due to the visual properties of reflection and refraction, RGB-D cameras cannot accurately capture the depth of transparent objects, leading to incomplete depth maps. To fill in the missing points, recent studies tend to explore new visual features and design complex networks to reconstruct the depth, however, these approaches tremendously increase computation, and the correlation of different visual features remains a problem. To this end, we propose an efficient depth completion network named DistillGrasp which distillates knowledge from the teacher branch to the student branch. Specifically, in the teacher branch, we design a position correlation block (PCB) that leverages RGB images as the query and key to search for the corresponding values, guiding the model to establish correct correspondence between two features and transfer it to the transparent areas. For the student branch, we propose a consistent feature correlation module (CFCM) that retains the reliable regions of RGB images and depth maps respectively according to the consistency and adopts a CNN to capture the pairwise relationship for depth completion. To avoid the student branch only learning regional features from the teacher branch, we devise a distillation loss that not only considers the distance loss but also the object structure and edge information. Extensive experiments conducted on the ClearGrasp dataset manifest that our teacher network outperforms state-of-the-art methods in terms of accuracy and generalization, and the student network achieves competitive results with a higher speed of 48 FPS. In addition, the significant improvement in a real-world robotic grasping system illustrates the effectiveness and robustness of our proposed system.
|
|
11:25-11:30, Paper WeCT16.3 | |
Real-Time Grasp Quality in Boundary-Constrained Granular Swarm Robots |
|
Mulroy, Declan | Illinois Institute of Technology |
Cañones Bonham, David Francesc | Illinois Institute of Technology |
Spenko, Matthew | Illinois Institute of Technology |
Srivastava, Ankit | Illinois Institute of Technology |
Keywords: Grasping, Swarm Robotics, Motion Control
Abstract: Soft robotic grippers offer advantages over rigid end effectors but are typically coupled to a rigid robot for locomotion. In contrast, this paper details a soft robot for both locomotion and grasping. The system is a type of boundary- constrained granular swarm robot, which is composed of a closed-loop series of active (capable of locomotion) sub-robots. Prior work has shown how this type of robot is capable of loco- motion and grasping. For this paper, we propose a new grasping strategy and demonstrate real-time grasp quality evaluation using pressure sensors and the Ferrari-Canny grasp metric. The grasping strategy leverages gradient-based control via distance functions and dynamic system planning to achieve desired robot geometries for effective grasping. Previous research primarily used pull tests to evaluate grasping efficacy, which lacked real- time feedback on grasp quality. Simulated and experimental results confirm the effectiveness of this method.
|
|
11:30-11:35, Paper WeCT16.4 | |
Learning Dual-Arm Coordination for Grasping Large Flat Objects |
|
Wang, Yongliang | University of Groningen |
Kasaei, Hamidreza | University of Groningen |
Keywords: Dexterous Manipulation, Bimanual Manipulation, Dual Arm Manipulation
Abstract: Grasping large flat objects, such as books or keyboards lying horizontally, presents significant challenges for single-arm robotic systems, often requiring extra actions like pushing objects against walls or moving them to the edge of a surface to facilitate grasping. In contrast, dual-arm manipulation, inspired by human dexterity, offers a more refined solution by directly coordinating both arms to lift and grasp the object without the need for complex repositioning. In this paper, we propose a model-free deep reinforcement learning (DRL) framework to enable dual-arm coordination for grasping large flat objects. We utilize a large scale grasp pose detection model as a backbone to extract high-dimensional features from input images, which are then used as the state representation in a reinforcement learning (RL) model. A CNN-based Proximal Policy Optimization (PPO) algorithm with shared Actor-Critic layers is employed to learn coordinated dual-arm grasp actions. The system is trained and tested in Isaac Gym and deployed to real robots. Experimental results demonstrate that our policy can effectively grasp large flat objects without requiring additional maneuvers. Furthermore, the policy exhibits strong generalization capabilities, successfully handling unseen objects. Importantly, it can be directly transferred to real robots without fine-tuning, consistently outperforming baseline methods.
|
|
11:35-11:40, Paper WeCT16.5 | |
QDGset: A Large Scale Grasping Dataset Generated with Quality-Diversity |
|
Huber, Johann | ISIR, Sorbonne Université |
Hélénon, François | Sorbonne Université |
Kappel, Mathilde | Institut Des Systèmes Intelligents Et De Robotique |
Páez Ubieta, Ignacio de Loyola | University of Alicante |
Gil, Pablo | University of Alicante |
Puente, Santiago | University of Alicante |
Ben Amar, Faiz | Université Pierre Et Marie Curie, Paris 6 |
Doncieux, Stéphane | Sorbonne University |
Keywords: Grasping, Data Sets for Robot Learning, Evolutionary Robotics
Abstract: Recent advances in AI have led to significant results in robotic learning, but skills like grasping remain partially solved. Many recent works exploit synthetic grasping datasets to learn to grasp unknown objects. However, those datasets were generated using simple grasp sampling methods using priors. Recently, Quality-Diversity (QD) algorithms have been proven to make grasp sampling significantly more efficient. In this work, we extend QDG-6DoF, a QD framework for generating object-centric grasps, to scale up the production of synthetic grasping datasets. We propose a data augmentation method that combines the transformation of object meshes with transfer learning from previous grasping repertoires. The conducted experiments show that this approach reduces the number of required evaluations per discovered robust grasp by up to 20%. We used this approach to generate QDGset, a dataset of 6DoF grasp poses that contains about 3.5 and 4.5 times more grasps and objects, respectively, than the previous state-of-the-art. Our method allows anyone to easily generate data, eventually contributing to a large-scale collaborative dataset of synthetic grasps.
|
|
11:40-11:45, Paper WeCT16.6 | |
Patch Tree: Exploiting the Gauss Map and Principal Component Analysis for Robotic Grasping |
|
Jia, Yan-Bin | Iowa State University |
Xue, Yuechuan | Amazon.com |
Tang, Ling | Iowa State University |
Keywords: Grasping, In-Hand Manipulation
Abstract: Grasp planning must consider an object's local geometry (at the finger contacts), for the range of applicable wrenches under friction, and its global geometry, for force closure and grasp quality. Most everyday objects have curved surfaces unamenable to a pure combinatorial approach but treatable with tools from differential geometry. Our idea is to ``discretize'' such a surface in a top-down fashion into elementary patches (e-patches), each consisting of points that would yield close enough wrenches. Preprocessing based on Gaussian curvature decomposes the surface into strictly convex, strictly concave, ruled, and saddle patches. The Gauss map guides the subdivision of any patch with a large variation in the contact force direction, with the aid of a Platonic solid. The principal component analysis (PCA) further subdivides any patch that has a large variation in torque. The final structure is called a {it patch tree}, which stores e-patches at its leaves, and force or torque ranges at its internal nodes. Grasp synthesis and optimization operates on the patch tree with a stack to efficiently prune away non-promising finger placements. Simulation and experiment with a Shadow Hand have been conducted over everyday items. The patch tree exhibits different levels of surface granularity. It has a good promise for efficient planning of finger gaits to carry out grasping and tool manipulation.
|
|
WeCT17 |
405 |
Localization 4 |
Regular Session |
Chair: Napp, Nils | Cornell University |
|
11:15-11:20, Paper WeCT17.1 | |
Improved Bag-Of-Words Image Retrieval with Geometric Constraints for Ground Texture Localization |
|
Wilhelm, Aaron | Cornell University |
Napp, Nils | Cornell University |
Keywords: Localization, SLAM, Mapping
Abstract: Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate k-means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method's effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.
|
|
11:20-11:25, Paper WeCT17.2 | |
Improving Indoor Localization Accuracy by Using an Efficient Implicit Neural Map Representation |
|
Kuang, Haofei | University of Bonn |
Pan, Yue | University of Bonn |
Zhong, Xingguang | University of Bonn |
Wiesmann, Louis | University of Bonn |
Behley, Jens | University of Bonn |
Stachniss, Cyrill | University of Bonn |
Keywords: Localization, Mapping, Deep Learning Methods
Abstract: Globally localizing a mobile robot in a known map is often a foundation for enabling robots to navigate and operate autonomously. In indoor environments, traditional Monte Carlo localization based on occupancy grid maps is considered the gold standard, but its accuracy is limited by the representation capabilities of the occupancy grid map. In this paper, we address the problem of building an effective map representation that allows to accurately perform probabilistic global localization. To this end, we propose an implicit neural map representation that is able to capture positional and directional geometric features from 2D LiDAR scans to efficiently represent the environment and learn a neural network that is able to predict both, the non-projective signed distance and a direction-aware projective distance for an arbitrary point in the mapped environment. This combination of neural map representation with a light-weight neural network allows us to design an efficient observation model within a conventional Monte Carlo localization framework for pose estimation of a robot in real time. We evaluated our approach to indoor localization on a publicly available dataset for global localization and the experimental results indicate that our approach is able to more accurately localize a mobile robot than other localization approaches employing occupancy or existing neural map representations. In contrast to other approaches employing an implicit neural map representation for 2D LiDAR localization, our approach allows to perform real-time pose tracking after convergence and near real-time global localization. The code of our approach is available at: url{https://github.com/PRBonn/enm-mcl}.
|
|
11:25-11:30, Paper WeCT17.3 | |
Semantic and Feature Guided Uncertainty Quantification of Visual Localization for Autonomous Vehicles |
|
Wu, Qiyuan | Cornell University |
Campbell, Mark | Cornell University |
Keywords: Localization, Sensor Fusion, Deep Learning for Visual Perception
Abstract: The uncertainty quantification of sensor measurements coupled with deep learning networks is crucial for many robotics systems, especially for safety-critical applications such as self-driving cars. This paper develops an uncertainty quantification approach in the context of visual localization for autonomous driving, where locations are selected based on images. Key to our approach is to learn the measurement uncertainty using light-weight sensor error model, which maps both image feature and semantic information to 2-dimensional error distribution. Our approach enables uncertainty estimation conditioned on the specific context of the matched image pair, implicitly capturing other critical, unannotated factors (e.g., city vs. highway, dynamic vs. static scenes, winter vs. summer) in a latent manner. We demonstrate the accuracy of our uncertainty prediction framework using the Ithaca365 dataset, which includes variations in lighting and weather (sunny, night, snowy). Both the uncertainty quantification of the sensor+network is evaluated, along with Bayesian localization filters using unique sensor gating method. Results show that the measurement error does not follow a Gaussian distribution with poor weather and lighting conditions, and is better predicted by our Gaussian Mixture model.
|
|
11:30-11:35, Paper WeCT17.4 | |
LiLoc: Lifelong Localization Using Adaptive Submap Joining and Egocentric Factor Graph |
|
Fang, Yixin | Southeast University |
Li, Yanyan | Technical University of Munich |
Qian, Kun | Southeast University |
Tombari, Federico | Technische Universität München |
Wang, Yue | Zhejiang University |
Lee, Gim Hee | National University of Singapore |
Keywords: Localization, Mapping, SLAM
Abstract: This paper proposes a versatile graph-based lifelong localization framework, LiLoc, which enhances its timeliness by maintaining a single central session while improves the accuracy through multi-modal factors between the central and subsidiary sessions. First, an adaptive submap joining strategy is employed to generate prior submaps (keyframes and poses) for the central session, and to provide priors for subsidiaries when constraints are needed for robust localization. Next, a coarse-to-fine pose initialization for subsidiary sessions is performed using vertical recognition and ICP refinement in the global coordinate frame. To elevate the accuracy of subsequent localization, we propose an egocentric factor graph (EFG) module that integrates the IMU preintegration, LiDAR odometry and scan match factors in a joint optimization manner. Specifically, the scan match factors are constructed by a novel propagation model that efficiently distributes the prior constrains as edges to the relevant prior pose nodes, weighted by noises based on keyframe registration errors. Additionally, the framework supports flexible switching between two modes: relocalization (RLM) and incremental localization (ILM) based on the proposed overlap-based mechanism to select or update the prior submaps from central session. The proposed LiLoc is tested on public and custom datasets, demonstrating accurate localization performance against state-of-the-art methods. Our codes will be publicly available on https://github.com/Yixin-F/LiLoc.
|
|
11:35-11:40, Paper WeCT17.5 | |
ReFeree: Radar-Based Lightweight and Robust Localization Using Feature and Free Space |
|
Kim, Hogyun | Inha University |
Choi, Byunghee | Inha University |
Choi, Euncheol | Inha University |
Cho, Younggun | Inha University |
Keywords: Localization, SLAM, Field Robots
Abstract: Place recognition plays an important role in achieving robust long-term autonomy. Real-world robots face a wide range of weather conditions (e.g. overcast, heavy rain, and snowing) and most sensors (i.e. camera, LiDAR) essentially functioning within or near-visible electromagnetic waves are sensitive to adverse weather conditions,making reliable localization difficult. In contrast, radar is gaining traction due to long electromagnetic waves, which are less affected by environmental changes and weather independence. In this work, we propose a radar-based lightweight and robust place recognition. We achieve rotational invariance and lightweight by selecting a one-dimensional ring-shaped description and robustness by mitigating the impact of false detection utilizing opposite noise characteristics between free space and feature. In addition, the initial heading can be estimated, which can assist in building a SLAM pipeline that combines odometry and registration, which takes into account onboard computing. The proposed method was tested for rigorous validation across various scenarios (i.e. single session, multi-session, and different weather conditions). In particular, we validate our descriptor achieving reliable place recognition performance through the results of extreme environments that lacked structural information such as an OORD dataset.
|
|
11:40-11:45, Paper WeCT17.6 | |
On the Consistency of Multi-Robot Cooperative Localization: A Transformation-Based Approach |
|
Hao, Ning | Harbin Institute of Technology |
He, Fenghua | Harbin Institute of Technology |
Tian, Chungeng | Harbin Institute of Technology |
Hou, Yi | Harbin Institute of Technology |
Keywords: Localization, SLAM, Multi-Robot Systems
Abstract: This paper investigates the inconsistency problem caused by the mismatch of observability properties commonly found in multi-robot cooperative localization (CL) and simultaneous localization and mapping (SLAM). To address this issue, we propose a transformation-based approach that introduces a linear time-varying transformation to ensure the transformed system possesses a state-independent unobservable subspace. Consequently, its observability properties remain unaffected by the linearization points. We establish the relationship between the unobservable subspaces of the original and transformed systems, guiding the design of the time-varying transformation. We then present a novel estimator based on this method, referred to as the Transformed EKF (T-EKF), which utilizes the transformed system for state estimation, thereby ensuring correct observability and thus consistency. The proposed approach has been extensively validated through both Monte Carlo simulations and real-world experiments, demonstrating better performance in terms of both accuracy and consistency compared to state-of-the-art methods.
|
|
WeCT18 |
406 |
Software Tools 2 |
Regular Session |
Chair: Wauters, Jolan | Ghent University |
|
11:15-11:20, Paper WeCT18.1 | |
Chemistry3D: Robotic Interaction Toolkit for Chemistry Experiments |
|
Li, Shoujie | Tsinghua Shenzhen International Graduate School |
Huang, Yan | Tsinghua University |
Guo, Changqing | Tsinghua University |
Wu, Tong | Tsinghua University |
Zhang, Jiawei | Tsinghua University |
Zhang, Linrui | Tsinghua University |
Ding, Wenbo | Tsinghua University |
Keywords: Software Tools for Benchmarking and Reproducibility, Software Architecture for Robotic and Automation, Methods and Tools for Robot System Design
Abstract: The advent of simulation engines has revolutionized learning and operational efficiency for robots, offering cost-effective and swift pipelines. However, the lack of a universal simulation platform tailored for chemical scenarios impedes progress in robotic manipulation and visualization of reaction processes. Addressing this void, we present Chemistry3D, an innovative toolkit that integrates extensive chemical and robotic knowledge. Chemistry3D not only enables robots to perform chemical experiments but also provides real-time visualization of temperature, color, and pH changes during reactions. Built on the NVIDIA Omniverse platform, Chemistry3D offers interfaces for robot operation, visual inspection, and liquid flow control, facilitating the simulation of special objects such as liquids and transparent entities. Leveraging this toolkit, we have devised RL tasks, object detection, and robot operation scenarios. Additionally, to discern disparities between the rendering engine and the real world, we conducted transparent object detection experiments using Sim2Real, validating the toolkit's exceptional simulation performance. The source code is available at https://github.com/huangyan28/Chemistry3D, and a related tutorial can be found at https://www.omni-chemistry.com.
|
|
11:20-11:25, Paper WeCT18.2 | |
Introducing KUGE: A Simultaneous Control Co-Design Architecture and Its Application to Aerial Robotics Development |
|
Wauters, Jolan | Ghent University |
Lefebvre, Tom | Ghent University |
Crevecoeur, Guillaume | Ghent University |
Keywords: Methods and Tools for Robot System Design, Optimization and Optimal Control, Aerial Systems: Applications
Abstract: The increasing complexity of tasks performed by hybrid aerial robotic systems, such as tail-sitters, demands a more integrated approach to their design. Traditional sequential design methods fall short because they separate the control system design from the conceptual design, limiting the potential for discovering coupled solutions. This disjointed process constrains the design space, making it difficult to optimize both the control performance and system dynamics simultaneously. In response to this limitation, there has been growing interest in mission-specific dynamic design procedures, which aim to address specific operational challenges by integrating control and design early in the development process. The multi-disciplinary approach of control co-design (CCD) expands the design space by solving control and system design problems concurrently. The recently introduced DAIMYO framework demonstrated that combining multi-fidelity modelling with a nested CCD approach can tackle the sim-to-real gap. However, DAIMYO’s reliance on Bayesian optimization to account for the computational cost increase of a nested formulation limits its scalability. To address these issues, we propose KUGE, a simultaneous CCD strategy that reduces computational complexity and overcomes dimensionality restrictions through a combined effort of stochastic optimization and Gaussian processes. We validate the effectiveness of KUGE by applying it to the dynamic design of a tail-sitter, showing that it is competitive with the DAIMYO architecture while offering greater computational efficiency.
|
|
11:25-11:30, Paper WeCT18.3 | |
HEROES: Unreal Engine-Based Human and Emergency Robot Operation Education System |
|
Chaudhary, Anav | Purdue University |
Tiwari, Kshitij | Purdue University |
Bera, Aniket | Purdue University |
Keywords: Simulation and Animation, Planning under Uncertainty, Task and Motion Planning
Abstract: Training and preparing first responders and humanitarian robots for Mass Casualty Incidents (MCIs) often poses a challenge owing to the lack of realistic and easily accessible test facilities. While such facilities can offer realistic scenarios post an MCI that can serve training and educational purposes for first responders and humanitarian robots, they are often hard to access owing to logistical constraints. To overcome this challenge, we present HEROES- a versatile Unreal Engine-based simulator for designing novel training simulations for humans and emergency robots for such urban search and rescue operations. The proposed HEROES simulator is capable of generating synthetic datasets for machine learning pipelines that are used for training robot navigation. This work addresses the necessity for a comprehensive training platform in the robotics community, ensuring pragmatic and efficient preparation for real-world emergency scenarios. The strengths of our simulator lie in its adaptability, scalability, and ability to facilitate collaboration between robot developers and first responders, fostering synergy in developing effective strategies for search and rescue operations in MCIs. We conducted a preliminary user study with an average score of 8.1 out of 10 supporting the ability of HEROES to generate sufficiently varied environments and a score of 7.8 out of 10 affirming the usefulness of the simulation environment. HEROES has been integrated with ROS and has been used to train an RL model for a real robot as a proof of concept.
|
|
11:30-11:35, Paper WeCT18.4 | |
On the Necessity of Real-Time Principles in GPU-Driven Autonomous Robots |
|
Ali, Syed | University of North Carolina at Chapel Hill |
Angelopoulos, Angelos | University of North Carolina at Chapel Hill |
Massey, Denver | University of North Carolina at Chapel Hill |
Haddix, Sarah Barnes | The University of North Carolina at Chapel Hill |
Georgiev, Alexander | University of North Carolina at Chapel Hill |
Goh, Joseph | University of North Carolina at Chapel Hill |
Wagle, Rohan | University of North Carolina at Chapel Hill |
Sarathy, Prakash | Northrop Grumman |
Anderson, James | University of North Carolina at Chapel Hill |
Alterovitz, Ron | University of North Carolina at Chapel Hill |
Keywords: Software Architecture for Robotic and Automation, Software, Middleware and Programming Environments, Robot Safety
Abstract: Robot autonomy is driving an ever-increasing demand for computational power, including on-board multi-core CPUs and accelerators such as GPUs, to enable fast perception, planning, control, and more. Careful scheduling of these computational tasks on the CPU cores and GPUs is important to prevent locking up the finite computational capacity in ways that hinder other critical workloads; delays in computing time-critical tasks like obstacle detection and control can have huge negative consequences for autonomous robots, potentially resulting in damage, substantial financial loss, or even loss of life. In this paper, we leverage recent advances from real-time systems research. We apply TimeWall, a component-based real-time framework, to the computational components of an autonomous drone and experimentally show that the timeliness and safe operation properties of a drone are preserved even in the presence of increasing interfering computational processes.
|
|
11:35-11:40, Paper WeCT18.5 | |
HPRM: High-Performance Robotic Middleware for Intelligent Autonomous Systems |
|
Kwok, Jacky | University of California, Berkeley |
Li, Shulu | UC Berkeley, Fudan University |
Lohstroh, Marten | UC Berkeley |
Lee, Edward A. | UC Berkeley |
Keywords: Software Architecture for Robotic and Automation, Computer Architecture for Robotic and Automation, Software, Middleware and Programming Environments
Abstract: The rise of intelligent autonomous systems, especially in robotics and autonomous agents, has created a critical need for robust communication middleware that can ensure real-time processing of extensive sensor data. Current robotics middleware like Robot Operating System (ROS) 2 faces challenges with nondeterminism and high communication latency when dealing with large data across multiple subscribers on a multi-core compute platform. To address these issues, we present High-Performance Robotic Middleware (HPRM), built on top of the deterministic coordination language Lingua Franca (LF). HPRM employs optimizations including an in-memory object store for efficient zero-copy transfer of large payloads, adaptive serialization to minimize serialization overhead, and an eager protocol with real-time sockets to reduce handshake latency. Benchmarks show HPRM achieves up to 114x lower latency than ROS2 when broadcasting large messages to multiple nodes. We then demonstrate the benefits of HPRM by integrating it with the CARLA simulator and running reinforcement learning agents along with object detection workloads. In the CARLA autonomous driving application, HPRM attains 91.1% lower latency than ROS2. The deterministic coordination semantics of HPRM, combined with its optimized IPC mechanisms, enable efficient and predictable real-time communication for intelligent autonomous systems. Code and videos can be found on our project page: https://hprm-robotics.github.io/HPRM
|
|
11:40-11:45, Paper WeCT18.6 | |
CusADi: A GPU Parallelization Framework for Symbolic Expressions and Optimal Control |
|
Jeon, Se Hwan | Massachusetts Institute of Technology |
Hong, Seungwoo | MIT (Massachusetts Institute of Technology) |
Lee, Ho Jae | Massachusetts Institute of Technology |
Khazoom, Charles | Massachusetts Institute of Technology |
Kim, Sangbae | Massachusetts Institute of Technology |
Keywords: Software Tools for Robot Programming, Optimization and Optimal Control, Reinforcement Learning
Abstract: The parallelism afforded by GPUs presents significant advantages in training controllers through reinforcement learning (RL). However, integrating model-based optimization into this process remains challenging due to the complexity of formulating and solving optimization problems across thousands of instances. In this work, we present CusADi, an extension of the CasADi symbolic framework to support the parallelization of arbitrary closed-form expressions on GPUs with CUDA. We also formulate a closed-form approximation for solving general optimal control problems, enabling large-scale parallelization and evaluation of MPC controllers. Our results show a ten-fold speedup relative to similar MPC implementation on the CPU, and we demonstrate the use of CusADi for various applications, including parallel simulation, parameter sweeps, and policy training.
|
|
WeCT19 |
407 |
System Design |
Regular Session |
|
11:15-11:20, Paper WeCT19.1 | |
Learning Optimal Design Manifolds to Design More Practical Robotic Systems |
|
Baumgärtner, Jan | Karlsruhe Institute of Technology |
Puchta, Alexander | Karlsruhe Institute of Technology |
Fleischer, Jürgen | Karlsruhe Institute of Technology (KIT) |
Keywords: Methods and Tools for Robot System Design, Optimization and Optimal Control, Representation Learning
Abstract: This paper introduces the optimal design manifold as a novel approach for understanding and optimizing the design of robotic systems. Existing optimization frameworks often jointly optimize design and behavior but lack insight into why specific designs are optimal for given tasks. Additionally, a functionally optimal design may not always be the most practical to build and practicality cannot always be captured by an objective function. By defining and learning the optimal design manifold, which represents the space of all optimal solutions, we provide a systematic method for exploring the design space and selecting the most practical optimal design. We apply the optimal design manifold to robot cell layout optimization, robot design optimization, and multi-camera placement and demonstrate its effectiveness in enhancing design choices by enabling a deeper understanding of what makes a design optimal.
|
|
11:20-11:25, Paper WeCT19.2 | |
Monotone Subsystem Decomposition for Efficient Multi-Objective Robot Design |
|
Wilhelm, Andrew | Cornell University |
Napp, Nils | Cornell University |
Keywords: Methods and Tools for Robot System Design, Optimization and Optimal Control, Formal Methods in Robotics and Automation
Abstract: Automating design minimizes errors, accelerates the design process, and reduces cost. However, automating robot design is challenging due to recursive constraints, multiple design objectives, and cross-domain design complexity possibly spanning multiple abstraction layers. Here we look at the problem of component selection, a combinatorial optimization problem in which a designer, given a robot model, must select compatible components from an extensive catalog. The goal is to satisfy high-level task specifications while optimally balancing trade-offs between competing design objectives. In this paper, we extend our previous constraint programming approach to multi-objective design problems and propose the novel technique of monotone subsystem decomposition to efficiently compute a Pareto front of solutions for large-scale problems. We prove that subsystems can be optimized for their Pareto fronts and, under certain conditions, these results can be used to determine a globally optimal Pareto front. Furthermore, subsystems serve as an intuitive design abstraction and can be reused across various design problems. Using an example quadcopter design problem, we compare our method to a linear programming approach and demonstrate our method scales better for large catalogs, solving a multi-objective problem of 10^25 component combinations in seconds. We then expand the original problem and solve a task-oriented, multi-objective design problem to build a fleet of quadcopters to deliver packages. We compute a Pareto front of solutions in seconds where each solution contains an optimal component-level design and an optimal package delivery schedule for each quadcopter.
|
|
11:25-11:30, Paper WeCT19.3 | |
Robust Reinforcement Learning-Based Locomotion for Resource-Constrained Quadrupeds with Exteroceptive Sensing |
|
Plozza, Davide | ETH Zürich |
Apostol, Patricia | ETH Zürich |
Joseph, Paul | ETH Zürich |
Schläpfer, Simon | ETH Zurich |
Magno, Michele | ETH Zurich |
Keywords: Engineering for Robotic Systems, Legged Robots, Reinforcement Learning
Abstract: Compact quadrupedal robots are proving increasingly suitable for deployment in real-world scenarios. Their smaller size fosters easy integration into human environments. Nevertheless, real-time locomotion on uneven terrains remains challenging, particularly due to the high computational demands of terrain perception. This paper presents a robust reinforcement learning-based exteroceptive locomotion controller for resource-constrained small-scale quadrupeds in challenging terrains, which exploits real-time elevation mapping, supported by a careful depth sensor selection. We concurrently train both a policy and a state estimator, which together provide an odometry source for elevation mapping, optionally fused with visual-inertial odometry (VIO). We demonstrate the importance of positioning an additional time-of-flight sensor for maintaining robustness even without VIO, thus having the potential to free up computational resources. We experimentally demonstrate that the proposed controller can flawlessly traverse steps up to 17.5 cm in height and achieve an 80% success rate on 22.5 cm steps, both with and without VIO. The proposed controller also achieves accurate forward and yaw velocity tracking of up to 1.0 m/s and 1.5 rad/s respectively. We open-source our training code at github.com/ETH-PBL/elmap-rl-controller.
|
|
11:30-11:35, Paper WeCT19.4 | |
AeroSafe: Mobile Indoor Air Purification Using Aerosol Residence Time Analysis and Robotic Cough Emulator Testbed |
|
Tonmoy, Tanjid | University of California San Diego |
Malladi, Rahath | Plaksha University |
Singh, Kaustubh | Plaksha University |
Forsad, Al Hossain | University of Massachusetts |
Gupta, Rajesh Kumar | Halicioglu Data Science Institute, UC San Diego |
Martinez, Andres Tejada | University of Florida |
Rahman, Tauhidur | University of California San Diego |
Keywords: Software-Hardware Integration for Robot Systems, Deep Learning Methods, Sensor-based Control
Abstract: Indoor air quality plays an essential role in the safety and well-being of occupants, especially in the context of airborne diseases. This paper introduces AeroSafe, a novel approach aimed at enhancing the efficacy of indoor air purification systems through a robotic cough emulator testbed and a digital-twins-based aerosol residence time analysis. Current portable air filters often overlook the concentrations of respiratory aerosols generated by coughs, posing a risk, particularly in high-exposure environments like healthcare facilities and public spaces. To address this gap, we present a robotic dual-agent physical emulator comprising a manoeuvrable mannequin simulating cough events and a portable air purifier autonomously responding to aerosols. The generated data from this emulator trains a digital twins model, combining a physics-based compartment model with a machine learning approach, using Long Short-Term Memory (LSTM) networks and graph convolution layers. Experimental results demonstrate the model's ability to predict aerosol concentration dynamics with a mean residence time prediction error within 35 seconds. The proposed system's real-time intervention strategies outperform static air filter placement, showcasing its potential in mitigating airborne pathogen risks.
|
|
11:35-11:40, Paper WeCT19.5 | |
Remote Inspection Techniques: A Review of Autonomous Robotic Inspection for Marine Vessels (I) |
|
Andersen, Rasmus Eckholdt | Technicel University of Denmark |
Brogaard, Rune Y. | Explicit Aps |
Boukas, Evangelos | Technical University of Denmark |
Keywords: Field Robots, Aerial Systems: Applications, Deep Learning Methods
Abstract: Due to the harsh environment and heavy use that modern marine vessels are subjected to, they are required to undergo periodic inspections to determine their current condition. The use of autonomous remote inspection systems can alleviate some of the dangers and shortcomings associated with manual inspection. While there has been research on the use of robotic platforms, none of the works in the literature evaluates the current state of the art with respect to the specifications of the classification societies, who are the most important stakeholders among the end users. The aim of this paper is to provide an overview of the existing literature and evaluate the works individually in collaboration with classification societies. The papers included in this review are either directly developed for, or have properties potentially transferable to, the marine vessel inspection process. To structure the review, an expertise-engineering separation is proposed based on the contributions of the individual paper. This separation shows which part of the inspection process has received the most attention, as well as where the shortcomings of each approach lay. The findings indicate that while there are promising approaches, there is still a gap between the classification societies’ requirements and the state of the art. Our results indicate that, there is quality work in the literature, but there is a lack of integrated development activities that achieve sufficient completeness.
|
|
11:40-11:45, Paper WeCT19.6 | |
Toward Fully Automated Aviation: PIBOT, a Humanoid Robot Pilot, for Human-Centric Aircraft Cockpits |
|
Min, Sungjae | Korea Advanced Institute of Science and Technology (KAIST) |
Kang, Gyuree | Korea Advanced Institute of Science and Technology (KAIST) |
Kim, Hyungjoo | Korea Advanced Institute of Science and Technology (KAIST) |
Shim, David Hyunchul | KAIST |
Keywords: Humanoid Robot Systems, AI-Enabled Robotics, Engineering for Robotic Systems
Abstract: Humanoid robots have been considered ideal for automating daily tasks, though most research has centered on bipedal locomotion. Many activities we do routinely, such as driving a car, require real-time system manipulation as well as substantial field-specific knowledge. Recent breakthroughs in natural language processing, particularly with large language models (LLMs), are empowering humanoid robots to access and process vast information sources and operate systems with an unprecedented level of autonomy. This article introduces PIBOT, a humanoid robot that can pilot unmodified general aviation (GA) aircraft, physically manipulating instruments while following strict rules of the air and verbally communicating with copilots and air traffic controllers (ATCs). Building on these capabilities, we developed an LLM-based task planner that interprets natural language commands, translating them into action sequences. Then, the behavior decision module breaks tasks into precise limb movements, enabling humanlike control of cockpit instruments. In a series of rigorous simulations, PIBOT demonstrates its capabilities to successfully take off and land an airplane from a cold-and-dark start, showcasing its potential for a fully autonomous robot pilot.
|
|
WeCT20 |
408 |
Human-Aware Robot Motion |
Regular Session |
Chair: Murphey, Todd | Northwestern University |
Co-Chair: Carlone, Luca | Massachusetts Institute of Technology |
|
11:15-11:20, Paper WeCT20.1 | |
Sampling-Based Grasp and Collision Prediction for Assisted Teleoperation |
|
Manschitz, Simon | Honda Research Institute Europe |
Güler, Berk | TU Darmstadt |
Ma, Wei | Honda Research Institute Europe |
Ruiken, Dirk | Honda Research Institute Europe |
Keywords: Telerobotics and Teleoperation
Abstract: Shared autonomy allows for combining the global planning capabilities of a human operator with the strengths of a robot such as repeatability and accurate control. In a real-time teleoperation setting, one possibility for shared autonomy is to let the human operator decide for the rough movement and to let the robot do fine adjustments, e.g., when the view of the operator is occluded. We present a learning-based concept for shared autonomy that aims at supporting the human operator in a real-time teleoperation setting. At every step, our system tracks the target pose set by the human operator as accurately as possible while at the same time satisfying a set of constraints which influence the robot’s behavior. An important characteristic is that the constraints can be dynamically activated and deactivated which allows the system to provide task-specific assistance. Since the system must generate robot commands in real-time, solving an optimization problem in every iteration is not feasible. Instead, we sample potential target configurations and use Neural Networks for predicting the constraint costs for each configuration. By evaluating each configuration in parallel, our system is able to select the target configuration which satisfies the constraints and has the minimum distance to the operator’s target pose with minimal delay. We evaluate the framework with a pick and place task on a bi-manual setup with two Franka Emika Panda robot arms with Robotiq grippers.
|
|
11:20-11:25, Paper WeCT20.2 | |
Inverse Mixed Strategy Games with Generative Trajectory Models |
|
Sun, Max Muchen | Northwestern University |
Trautman, Peter | Honda Research Institute |
Murphey, Todd | Northwestern University |
Keywords: Human-Aware Motion Planning, Path Planning for Multiple Mobile Robots or Agents, Probabilistic Inference
Abstract: Game-theoretic models are effective tools for modeling multi-agent interactions, especially when robots need to coordinate with humans. However, applying these models requires inferring their specifications from observed behaviors---a challenging task known as the inverse game problem. Existing inverse game approaches often struggle to account for behavioral uncertainty and measurement noise, and leverage both offline and online data. To address these limitations, we propose an inverse game method that integrates a generative trajectory model into a differentiable mixed-strategy game framework. By representing the mixed strategy with a conditional variational autoencoder (CVAE), our method can infer high-dimensional, multi-modal behavior distributions from noisy measurements while adapting in real-time to new observations. We extensively evaluate our method in a simulated navigation benchmark, where the observations are generated by an unknown game model. Despite the model mismatch, our method can infer Nash-optimal actions comparable to those of the ground-truth model and the oracle inverse game baseline, even in the presence of uncertain agent objectives and noisy measurements.
|
|
11:25-11:30, Paper WeCT20.3 | |
AToM: Adaptive Theory-Of-Mind-Based Human Motion Prediction in Long-Term Human-Robot Interactions |
|
Liao, Yuwen | Nanyang Technological University |
Cao, Muqing | Carnegie Mellon University |
Xu, Xinhang | Nanyang Technological University |
Xie, Lihua | NanyangTechnological University |
Keywords: Social HRI, Intention Recognition, Long term Interaction
Abstract: Humans learn from observations and experiences to adjust their behaviours towards better performance. Interacting with such dynamic humans is challenging, as the robot needs to predict the humans accurately for safe and efficient operations. Long-term interactions with dynamic humans have not been extensively studied by prior works. We propose an adaptive human prediction model based on the Theory-of-Mind (ToM), a fundamental social-cognitive ability that enables humans to infer others’ behaviours and intentions. We formulate the human internal belief about others using a game-theoretic model, which predicts the future motions of all agents in a navigation scenario. To estimate an evolving belief, we use an Unscented Kalman Filter to update the behavioural parameters in the human internal model. Our formulation provides unique interpretability to dynamic human behaviours by inferring how the human predicts the robot. We demonstrate through long-term experiments in both simulations and real-world settings that our prediction effectively promotes safety and efficiency in downstream robot planning. Code will be available at https://github.com/centiLinda/AToM-human-prediction.git.
|
|
11:30-11:35, Paper WeCT20.4 | |
Learning Dynamic Weight Adjustment for Spatial-Temporal Trajectory Planning in Crowd Navigation |
|
Cao, Muqing | Carnegie Mellon University |
Xu, Xinhang | Nanyang Technological University |
Yang, Yizhuo | Nangyang Technological Univercity |
Li, Jianping | Nanyang Technological University |
Jin, Tongxing | Nanyang Technological University |
Wang, Pengfei | Nanyang Technological University |
Hung, Tzu-Yi | Delta Electronics |
Lin, Guosheng | Nanyang Technological University |
Xie, Lihua | NanyangTechnological University |
Keywords: Human-Aware Motion Planning, Motion and Path Planning, Reinforcement Learning
Abstract: Robot navigation in dense human crowds poses a significant challenge due to the complexity of human behavior in dynamic and obstacle-rich environments. In this work, we propose a dynamic weight adjustment scheme using a neural network to predict the optimal weights of objectives in an optimization-based motion planner. We adopt a spatial-temporal trajectory planner and incorporate diverse objectives to achieve a balance among safety, efficiency, and goal achievement in complex and dynamic environments. We design the network structure, observation encoding, and reward function to effectively train the policy network using reinforcement learning, allowing the robot to adapt its behavior in real time based on environmental and pedestrian information. Simulation results show improved safety compared to the fixed-weight planner and the state-of-the-art learning-based methods, and verify the ability of the learned policy to adaptively adjust the weights based on the observed situations. The feasibility of the approach is demonstrated in a navigation task using an autonomous delivery robot across a crowded corridor over a 300 m distance.
|
|
11:35-11:40, Paper WeCT20.5 | |
COLLAGE: COLLAborative Human-Agent Interaction Generation Using Hierarchical Latent Diffusion and Language Models |
|
Daiya, Divyanshu | Purdue University |
Conover, Damon | DEVCOM Army Research Laboratory |
Bera, Aniket | Purdue University |
Keywords: Human and Humanoid Motion Analysis and Synthesis, Motion and Path Planning, Modeling and Simulating Humans
Abstract: We propose a novel framework COLLAGE for generating collaborative agent-object-agent interactions by leveraging large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs). Our model addresses the lack of rich datasets in this domain by incorporating the knowledge and reasoning abilities of LLMs to guide a generative diffusion model. The hierarchical VQ-VAE architecture captures different motion-specific characteristics at multiple levels of abstraction, avoiding redundant concepts and enabling efficient multi-resolution representation. We introduce a diffusion model that operates in the latent space and incorporates LLM-generated motion planning cues to guide the denoising process, resulting in prompt-specific motion generation with greater control and diversity. Experimental results on the CORE-4D, and InterHuman datasets demonstrate the effectiveness of our approach in generating realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods. Our work opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics and computer vision.
|
|
11:40-11:45, Paper WeCT20.6 | |
Long-Term Human Trajectory Prediction Using 3D Dynamic Scene Graphs |
|
Gorlo, Nicolas | Massachusetts Institute of Technology |
Schmid, Lukas M. | Massachusetts Institute of Technology (MIT) |
Carlone, Luca | Massachusetts Institute of Technology |
Keywords: Human and Humanoid Motion Analysis and Synthesis, Datasets for Human Motion, Long term Interaction
Abstract: We present a novel approach for long-term human trajectory prediction in indoor human-centric environments, which is essential for long-horizon robot planning in these environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage Large Language Models (LLMs) to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood (NLL) and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged (i.e., evaluated in a zero-shot fashion on the dataset) baselines for a time horizon of 60s.
|
|
WeCT21 |
410 |
Robot Foundation Models 2 |
Regular Session |
Co-Chair: Zhu, Yuke | The University of Texas at Austin |
|
11:15-11:20, Paper WeCT21.1 | |
LUMOS: Language-Conditioned Imitation Learning with World Models |
|
Nematollahi, Iman | University of Freiburg |
DeMoss, Branton | University of Oxford |
L Chandra, Akshay | University of Freiburg |
Hawes, Nick | University of Oxford |
Burgard, Wolfram | University of Technology Nuremberg |
Posner, Ingmar | Oxford University |
Keywords: Imitation Learning, Reinforcement Learning
Abstract: We introduce LUMOS, a language-conditioned multi-task imitation learning framework for robotics. LUMOS learns skills by practicing them over many long-horizon rollouts in the latent space of a learned world model and transfers these skills zero-shot to a real robot. By learning on-policy in the latent space of the learned world model, our algorithm mitigates policy-induced distribution shift which most offline imitation learning methods suffer from. LUMOS learns from unstructured play data with fewer than 1% hindsight language annotations but is steerable with language commands at test time. We achieve this coherent long-horizon performance by combining latent planning with both image- and language-based hindsight goal relabeling during training, and by optimizing an intrinsic reward defined in the latent space of the world model over multiple time steps, effectively reducing covariate shift. In experiments on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior learning-based methods with comparable approaches on chained multi-task evaluations. To the best of our knowledge, we are the first to learn a language-conditioned continuous visuomotor control for a real-world robot within an offline world model. Videos, dataset and code are available at http://lumos.cs.uni-freiburg.de.
|
|
11:20-11:25, Paper WeCT21.2 | |
LIMT: Language-Informed Multi-Task Visual World Models |
|
Aljalbout, Elie | University of Zurich |
Sotirakis, Nikolaos | Technical University of Munich |
van der Smagt, Patrick | Volkswagen Group |
Karl, Maximilian | Foundation Robotics Labs |
Chen, Nutan | Volkswagen Group |
Keywords: Reinforcement Learning, Representation Learning, Machine Learning for Robot Control
Abstract: Most recent successes in robot reinforcement learning involve learning a specialized single-task agent. However, robots capable of performing multiple tasks can be much more valuable in real-world applications. Multi-task reinforcement learning can be very challenging due to the increased sample complexity and the potentially conflicting task objectives. Previous work on this topic is dominated by model-free approaches. The latter can be very sample inefficient even when learning specialized single-task agents. In this work, we focus on model-based multi-task reinforcement learning. We propose a method for learning multi-task visual world models, leveraging pre-trained language models to extract semantically meaningful task representations. These representations are used by the world model and policy to reason about task similarity in dynamics and behavior. Our results highlight the benefits of using language-driven task representations for world models and a clear advantage of model-based multi-task learning over the more common model-free paradigm.
|
|
11:25-11:30, Paper WeCT21.3 | |
Towards Robust Autonomous Driving: Conditional Multimodal Large Language Models for Fine-Grained Perception |
|
Sun, Fengzhao | University of Science and Technology of China |
Yu, Jun | University of Science and Technology of China |
Zhang, Yunxiang | University of Science and Technology of China |
Hou, Jiaming | Harbin Institute of Technology |
Lu, Xilong | University of Science and Technology |
Song, Heng | China Railway No.4 Engineering Group Co., Ltd |
Gao, Fang | Guangxi University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, AI-Based Methods
Abstract: Multimodal large language models (MLLMs) have shown remarkable performance across various visual understanding tasks. However, most existing MLLMs still lack image detail perception, limiting their effectiveness in tasks that require detailed visual information. In this paper, we introduce Percept-DriveLM, a novel MLLM designed to tackle the fine-grained perception challenges in autonomous driving tasks. At the core of our model is the Visual Fusion Module, which integrates several innovative components: a dynamic resolution mechanism that combines both high and low resolution features, and an RoI conditional mechanism to incorporate object/region-level features identified by offline detectors, further refining the model's fine-grained perception abilities. Trained in a two-stage process, our model demonstrates exceptional performance, outperforming existing MLLMs with comparable parameter sizes and excelling in both autonomous driving perception and general vision-language tasks. The effectiveness of our approach is validated through extensive empirical studies. Code will be available at https://github.com/DebuggerSunfz/PerceptDriveLM.
|
|
11:30-11:35, Paper WeCT21.4 | |
Automated Hybrid Reward Scheduling Via Large Language Models for Robotic Skill Learning |
|
Huang, Changxin | Shenzhen University |
Liang, Junyang | Shenzhen University |
Chang, Yanbin | Shenzhen University |
Xu, Jingzhao | Shenzhen University |
Li, Jianqiang | Shenzhen University, |
Keywords: Reinforcement Learning, Machine Learning for Robot Control
Abstract: Enabling a high-degree-of-freedom robot to learn specific skills is a challenging task due to the complexity of robotic dynamics. Reinforcement learning (RL) has emerged as a promising solution; however, addressing such problems requires the design of multiple reward functions to account for various constraints in robotic motion. Existing approaches typically sum all reward components indiscriminately to optimize the RL value function and policy. We argue that this uniform inclusion of all reward components in policy optimization is inefficient and limits the robot’s learning performance. To address this, we propose an Automated Hybrid Reward Scheduling (AHRS) framework based on Large Language Models (LLMs). This paradigm dynamically adjusts the learning intensity of each reward component throughout the policy optimization process, enabling robots to acquire skills in a gradual and structured manner. Specifically, we design a multi-branch value network, where each branch corresponds to a distinct reward component. During policy optimization, each branch is assigned a weight that reflects its importance, and these weights are automatically computed based on rules designed by LLMs. The LLM generates a rule set in advance, derived from the task description, and during training, it selects a weight calculation rule from the library based on language prompts that evaluate the performance of each branch. Experimental results demonstrate that the AHRS method achieves an average 6.48% performance improvement across multiple high-degree-of-freedom robotic tasks.
|
|
11:35-11:40, Paper WeCT21.5 | |
RT-Affordance: Affordances Are Versatile Intermediate Representations for Robot Manipulation |
|
Nasiriany, Soroush | The University of Austin at Texas |
Kirmani, Sean | Google DeepMind |
Ding, Tianli | Google |
Smith, Laura | UC Berkeley |
Zhu, Yuke | The University of Texas at Austin |
Driess, Danny | TU Berlin |
Sadigh, Dorsa | Stanford University |
Xiao, Ted | Google DeepMind |
Keywords: Imitation Learning, Big Data in Robotics and Automation, Deep Learning Methods
Abstract: We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at https://snasiriany.me/rt-affordance
|
|
11:40-11:45, Paper WeCT21.6 | |
A Real-To-Sim-To-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards |
|
Patel, Shivansh | University of Illinois Urbana Champaign |
Yin, Xinchen | University of Illinois Urbana Champaign |
Huang, Wenlong | Stanford University |
Garg, Shubham | Amazon |
Nayyeri, Hooshang | Amazon |
Fei-Fei, Li | Stanford University |
Lazebnik, Svetlana | University of Illinois |
Li, Yunzhu | Columbia University |
Keywords: Machine Learning for Robot Control, Sensorimotor Learning, Deep Learning in Grasping and Manipulation
Abstract: Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.
|
|
WeCT22 |
411 |
Imitation Learning 2 |
Regular Session |
Chair: Johns, Edward | Imperial College London |
Co-Chair: Kaelbling, Leslie | MIT |
|
11:15-11:20, Paper WeCT22.1 | |
Learning Task Specifications from Demonstrations As Probabilistic Automata |
|
Baert, Mattijs | Ghent University |
Leroux, Sam | Ghent University |
Simoens, Pieter | Ghent University - Imec |
Keywords: Learning from Demonstration, Imitation Learning, Task Planning
Abstract: Specifying tasks for robotic systems traditionally requires coding expertise, deep domain knowledge, and significant time investment. While learning from demonstration offers a promising alternative, existing methods often struggle with tasks of longer horizons. To address this limitation, we introduce a computationally efficient approach for learning probabilistic deterministic finite automata (PDFA) that capture task structures and expert preferences directly from demonstrations. Our approach infers sub-goals and their temporal dependencies, producing an interpretable task specification that domain experts can easily understand and adjust. We validate our method through experiments involving object manipulation tasks, showcasing how our method enables a robot arm to effectively replicate diverse expert strategies while adapting to changing conditions.
|
|
11:20-11:25, Paper WeCT22.2 | |
Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments |
|
Etukuru, Haritheja | New York University |
Naka, Norihito | New York University |
Hu, Zijin | New York University |
Lee, Seungjae | Seoul National University |
Mehu, Julian | Hello Robot Inc |
Edsinger, Aaron | Hello Robot |
Paxton, Chris | Meta AI |
Chintala, Soumith | Facebook AI Research |
Pinto, Lerrel | New York University |
Shafiullah, Nur Muhammad (Mahi) | New York University |
Keywords: Imitation Learning, Big Data in Robotics and Automation, Learning from Demonstration
Abstract: Robot models, particularly those trained with large amounts of data, have recently shown a plethora of real-world manipulation and navigation capabilities. Several independent efforts have shown that given sufficient training data in an environment, robot policies can generalize to demonstrated variations in that environment. However, needing to finetune robot models to every new environment stands in stark contrast to models in language or vision that can be deployed zero-shot for open-world problems. In this work, we present Robot Utility Models (RUMs), a framework for training and deploying zero-shot robot policies that can directly generalize to new environments without any finetuning. To create RUMs efficiently, we develop new tools to quickly collect data for mobile manipulation tasks, integrate such data into a policy with multi-modal imitation learning, and deploy policies on-device on Hello Robot Stretch, a cheap commodity robot, with an external mLLM verifier for retrying. We train five such utility models for opening cabinet doors, opening drawers, picking up napkins, picking up paper bags, and reorienting fallen objects. Our system, on average, achieves 90% success rate in unseen, novel environments interacting with unseen objects. Moreover, the utility models can also succeed in different robot and camera set-ups with no further data, training, or fine-tuning. Primary among our lessons are the importance of training data over training algorithm and policy class, guidance about data scaling, necessity for diverse yet high-quality demonstrations, and a recipe for robot introspection and retrying to improve performance on individual environments.
|
|
11:25-11:30, Paper WeCT22.3 | |
R+X: Retrieval and Execution from Everyday Human Videos |
|
Papagiannis, Georgios | Imperial College London |
Di Palo, Norman | Imperial College London |
Vitiello, Pietro | Imperial College London |
Johns, Edward | Imperial College London |
Keywords: Learning from Demonstration, Imitation Learning, Continual Learning
Abstract: We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Appendix and videos are available at https://www.robot-learning.uk/r-plus-x.
|
|
11:30-11:35, Paper WeCT22.4 | |
ARCap: Collecting High-Quality Human Demonstrations for Robot Learning with Augmented Reality Feedback |
|
Chen, Sirui | Stanford University |
Wang, Chen | Stanford University |
Nguyen, Kaden | Stanford University |
Fei-Fei, Li | Stanford University |
Liu, Karen | Stanford University |
Keywords: Imitation Learning, Virtual Reality and Interfaces, Dexterous Manipulation
Abstract: Recent progress in imitation learning from human demonstrations has shown promising results in teaching robots manipulation skills. To further scale up training datasets, recent works start to use portable data collection devices without the need for physical robot hardware. However, due to the absence of on-robot feedback during data collection, the data quality depends heavily on user expertise, and many devices are limited to specific robot embodiments. We propose ARCap, a portable data collection system that provides visual feedback through augmented reality (AR) and haptic warnings to guide users in collecting high-quality demonstrations. Through extensive user studies, we show that ARCap enables novice users to collect robot-executable data that matches robot kinematics and avoids collisions with the scenes. With data collected from ARCap, robots can perform challenging tasks, such as manipulation in cluttered environments and long-horizon cross-embodiment manipulation. ARCap is fully open-source and easy to calibrate; all components are built from off-the-shelf products. More details can be found on our website: href{https://stanford-tml.github.io/ARCap}{stanford-tml.gi thub.io/ARCap}
|
|
11:35-11:40, Paper WeCT22.5 | |
XMoP: Whole-Body Control Policy for Zero-Shot Cross-Embodiment Neural Motion Planning |
|
Rath, Prabin Kumar | Arizona State University |
Gopalan, Nakul | Arizona State University |
Keywords: Learning from Demonstration, Whole-Body Motion Planning and Control, Collision Avoidance
Abstract: Classical manipulator motion planners work across different robot embodiments. However they plan on a pre-specified static environment representation, and are not scalable to unseen dynamic environments. Neural Motion Planners (NMPs) are an appealing alternative to conventional planners as they incorporate different environmental constraints to learn motion policies directly from raw sensor observations. Contemporary state-of-the-art NMPs can successfully plan across different environments. However none of the existing NMPs generalize across robot embodiments. In this paper we propose Cross-Embodiment Motion Policy (XMoP), a neural policy for learning to plan over a distribution of manipulators. XMoP implicitly learns to satisfy kinematic constraints for a distribution of robots and zero-shot transfers the planning behavior to unseen robotic manipulators within this distribution. We achieve this generalization by formulating a whole-body control policy that is trained on planning demonstrations from over three million procedurally sampled robotic manipulators in different simulated environments. Despite being completely trained on synthetic embodiments and environments, our policy exhibits strong sim-to-real generalization across manipulators with different kinematic variations and degrees of freedom with a single set of frozen policy parameters. We evaluate XMoP on 7 commercial manipulators and show successful cross-embodiment motion planning, achieving an average 70% success rate on baseline benchmarks. Furthermore, we demonstrate sim-to-real deployment on two unseen manipulators solving novel planning problems across three real-world domains even with dynamic obstacles.
|
|
11:40-11:45, Paper WeCT22.6 | |
KALM: Keypoint Abstraction Using Large Models for Object-Relative Imitation Learning |
|
Fang, Xiaolin | MIT |
Huang, Bo-Ruei | National Taiwan University |
Mao, Jiayuan | MIT |
Shone, Jasmine | MIT |
Tenenbaum, Joshua | Massachusetts Institute of Technology |
Lozano-Perez, Tomas | MIT |
Kaelbling, Leslie | MIT |
Keywords: Learning from Demonstration, Deep Learning in Grasping and Manipulation, Imitation Learning
Abstract: Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Videos can be found at https://kalm-il.github.io/.
|
|
WeCT23 |
412 |
Autonomous Vehicle Perception 5 |
Regular Session |
Chair: Chun, Il Yong | Sungkyunkwan University |
|
11:15-11:20, Paper WeCT23.1 | |
AutoSplat: Constrained Gaussian Splatting for Autonomous Driving Scene Reconstruction |
|
Khan, Mustafa | University of Toronto |
Fazlali, Hamidreza | Noah's Ark Lab |
Sharma, Dhruv | Huawei Research Canada |
Cao, Tongtong | Noah's Ark Lab, Huawei Technologies |
Bai, Dongfeng | Noah's Ark Lab, Huawei Technologies |
Ren, Yuan | Noah's Ark Lab, Huawei Technologies Canada Inc |
Liu, Bingbing | Huawei Technologies |
Keywords: Autonomous Agents, Simulation and Animation, AI-Enabled Robotics
Abstract: Realistic scene reconstruction and view synthesis are essential for advancing autonomous driving systems by simulating safety-critical scenarios. 3D Gaussian Splatting excels in real-time rendering and static scene reconstructions but struggles with modeling driving scenarios due to complex backgrounds, dynamic objects, and sparse camera views. We propose AutoSplat, a framework employing Gaussian splatting to realistically reconstruct autonomous driving scenes. By imposing geometric constraints on Gaussians representing the road and sky regions, our method enables multi-view consistent simulation of challenging scenarios, including lane changes. Leveraging 3D templates, we introduce a reflected Gaussian consistency constraint to supervise both the visible and unseen side of foreground objects. Moreover, to model the dynamic appearance of foreground objects, we estimate temporally-dependent residual spherical harmonics for each foreground Gaussian. Extensive experiments on Pandaset and KITTI demonstrate that AutoSplat outperforms state-of-the-art methods in scene reconstruction and novel view synthesis across diverse driving scenarios. Our project page can be found here: https://autosplat.github.io/
|
|
11:20-11:25, Paper WeCT23.2 | |
Diffusion-Based Generative Models for 3D Occupancy Prediction in Autonomous Driving |
|
Wang, Yunshen | Beijing University of Posts and Telecommunications |
Liu, Yicheng | Tsinghua University |
Yuan, Tianyuan | Tsinghua University |
Mao, Yucheng | University of Science and Techonology Beijing |
Liang, Yingshi | Beijing University of Posts and Telecommunications |
Yang, Xiuyu | Tsinghua University |
Zhang, Honggang | Beijing University of Posts and Telecommunications |
Zhao, Hang | Tsinghua University |
Keywords: Autonomous Agents, Deep Learning for Visual Perception, Semantic Scene Understanding
Abstract: Accurately predicting 3D occupancy grids from visual inputs is critical for autonomous driving, but current discriminative methods struggle with noisy data, incomplete observations, and the complex structures inherent in 3D scenes. In this work, we reframe 3D occupancy prediction as a generative modeling task using diffusion models, which learn the underlying data distribution and incorporate 3D scene priors. This approach enhances prediction consistency, noise robustness, and better handles the intricacies of 3D spatial structures. Our extensive experiments show that diffusion-based generative models outperform state-of-the-art discriminative approaches, delivering more realistic and accurate occupancy predictions, especially in occluded or low-visibility regions. Moreover, the improved predictions significantly benefit downstream planning tasks, highlighting the practical advantages of our method for real-world autonomous driving applications.
|
|
11:25-11:30, Paper WeCT23.3 | |
Interactive4D: Interactive 4D LiDAR Segmentation |
|
Fradlin, Ilya | Rwth Aachen |
Zulfikar, Idil Esen | RWTH Aachen |
Yilmaz, Kadir | RWTH Aachen University |
Kontogianni, Theodora | ETH Zurich |
Leibe, Bastian | RWTH Aachen University |
Keywords: Object Detection, Segmentation and Categorization, Human-Robot Collaboration, Deep Learning for Visual Perception
Abstract: Interactive segmentation has an important role in facilitating the annotation process of future LiDAR datasets. Existing approaches sequentially segment individual objects at each LiDAR scan, repeating the process throughout the entire sequence, which is redundant and ineffective. In this work, we propose interactive 4D segmentation, a new paradigm that allows segmenting multiple objects on multiple LiDAR scans simultaneously, and Interactive4D, the first interactive 4D segmentation model that segments multiple objects on superimposed consecutive LiDAR scans in a single iteration by utilizing the sequential nature of LiDAR data. While performing interactive segmentation, our model leverages the entire space- time volume, leading to more efficient segmentation. Operating on the 4D volume, it directly provides consistent instance IDs over time and also simplifies tracking annotations. Moreover, we show that click simulations are crucial for successful model training on LiDAR point clouds. To this end, we design a click simulation strategy that is better suited for the char- acteristics of LiDAR data. To demonstrate its accuracy and effectiveness, we evaluate Interactive4D on multiple LiDAR datasets, where Interactive4D achieves a new state-of-the-art by a large margin. We publicly release the code and models at https://vision.rwth-aachen.de/Interactive4D.
|
|
11:30-11:35, Paper WeCT23.4 | |
Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms |
|
Lin, Chun-Jung | The University of Adelaide |
Garg, Sourav | University of Adelaide |
Chin, Tat-Jun | The University of Adelaide |
Dayoub, Feras | The University of Adelaide |
Keywords: Semantic Scene Understanding, Deep Learning for Visual Perception, Environment Monitoring and Management
Abstract: We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences. In order to effectively learn correspondences and mis-correspondences between an image pair for the change detection task, we propose to a) “freeze” the backbone in order to retain the generality of dense foundation features, and b) employ “full-image” cross-attention to better tackle the viewpoint variations between the image pair. We evaluate our approach on two benchmark datasets, VL-CMU-CD and PSCD, along with their viewpoint-varied versions. Our experiments demonstrate significant improvements in F1-score, particularly in scenarios involving geometric changes between image pairs. The results indicate our method’s superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments. Detailed ablation studies further validate the contributions of each component in our architecture. Our source code is available at: https://github.com/ChadLin9596/Robust-Scene-Change-Detection.
|
|
11:35-11:40, Paper WeCT23.5 | |
LaB-CL: Localized and Balanced Contrastive Learning for Improving Parking Slot Detection |
|
Jeong, U Jin | Sungkyunkwan University |
Roh, Sumin | Sungkyunkwan University |
Chun, Il Yong | Sungkyunkwan University |
Keywords: Object Detection, Segmentation and Categorization, Representation Learning, AI-Based Methods
Abstract: Parking slot detection is an essential technology in autonomous parking systems. In general, the classification problem of parking slot detection consists of two tasks, a task determining whether localized candidates are junctions of parking slots or not, and the other that identifies a shape of detected junctions. Both classification tasks can easily face biased learning toward the majority class, degrading classification performances. Yet, the data imbalance issue has been overlooked in parking slot detection. We propose the first supervised contrastive learning framework for parking slot detection, Localized and Balanced Contrastive Learning for improving parking slot detection (LaB-CL). The proposed LaB-CL framework uses two main approaches. First, we propose to include class prototypes to consider representations from all classes in every mini batch, from the local perspective. Second, we propose a new hard negative sampling scheme that selects local representations with high prediction error. Experiments with the benchmark dataset demonstrate that the proposed LaB-CL framework can outperform existing parking slot detection methods.
|
|
11:40-11:45, Paper WeCT23.6 | |
LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction Using LiDAR and Camera |
|
Ma, Yukai | Zhejiang University |
Mei, Jianbiao | Zhejiang University |
Yang, Xuemeng | Shanghai Artificial Intelligence Laboratory |
Wen, Licheng | Shanghai AI Laboratory |
Xu, Weihua | Zhejiang University |
Zhang, Jiangning | Zhejiang University |
Zuo, Xingxing | Caltech |
Shi, Botian | Shanghai AI Laboratory |
Liu, Yong | Zhejiang University |
Keywords: AI-Enabled Robotics, Sensor Fusion, Deep Learning for Visual Perception
Abstract: Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system's robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this paper, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes and enhancing SSC performance. Regarding model architecture, we propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules—CMRD, BRD, and PDD. Our approach enhances the performance in radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion, R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively. The project page is available at url{https://hr-zju.github.io/LiCROcc/}.
|
|
WeDT1 |
302 |
Autonomous Vehicles 1 |
Regular Session |
Co-Chair: Li, Xiaopeng | University of Wisconsin-Madison |
|
15:15-15:20, Paper WeDT1.1 | |
Diverse Controllable Diffusion Policy with Signal Temporal Logic |
|
Meng, Yue | Massachusetts Institute of Technology |
Fan, Chuchu | Massachusetts Institute of Technology |
Keywords: Autonomous Agents, Autonomous Vehicle Navigation, Machine Learning for Robot Control
Abstract: Generating realistic simulations is critical for autonomous system applications such as self-driving and human-robot interactions. However, driving simulators nowadays still have difficulty in generating controllable, diverse, and rule-compliant behaviors for road participants: Rule-based models cannot produce diverse behaviors and require careful tuning, whereas learning-based methods imitate the policy from data but are not designed to follow the rules explicitly. Besides, the real-world datasets are by nature "single-outcome", making the learning method hard to generate diverse behaviors. In this paper, we leverage Signal Temporal Logic (STL) and Diffusion Models to learn controllable, diverse, and rule-aware policy. We first calibrate the STL on the real-world data, then generate diverse synthetic data using trajectory optimization, and finally learn the rectified diffusion policy on the augmented dataset. We test on the NuScenes dataset and our approach can achieve the most diverse rule-compliant trajectories compared to other baselines, with a runtime 1/17X to the second-best approach. In closed-loop testing, our approach reaches the highest diversity, rule satisfaction rate, and the lowest collision rate. Our method can generate varied characteristics conditional on different STL parameters in testing. A case study on human-robot encounter scenarios shows our approach can generate diverse and closed-to-oracle trajectories. The annotation tool, augmented dataset, and code are available at https://github.com/mengyuest/pSTL-diffusion-policy.
|
|
15:20-15:25, Paper WeDT1.2 | |
Dual-Conditioned Temporal Diffusion Modeling for Driving Scene Generation |
|
Bai, Xiangyu | Northeastern University |
Luo, Yedi | Northeastern University |
Jiang, Le | Northeastern University |
Ostadabbas, Sarah | Northeastern University |
Keywords: Autonomous Vehicle Navigation, Deep Learning for Visual Perception, Visual Learning
Abstract: Diffusion models have proven effective at generating high-quality images from learned distributions, but their application to the temporal domain, especially for driving scenarios, remains underexplored. Our work addresses key challenges in existing simulations, such as limited data quality, diversity, and high costs, by extending diffusion models to generate realistic long driving videos. We introduce the Dual-conditioned Temporal Diffusion Model (DcTDM), an open-source method that incorporates dual conditioning to enforce temporal consistency by guiding frame transitions. Alongside DcTDM, we present DriveSceneDDM, a comprehensive driving video dataset featuring textual scene descriptions, dense depth maps, and canny edge data. We evaluate DcTDM using common video quality metrics, demonstrating its superior performance over other video diffusion models by producing long, temporally consistent driving videos up to 40s, achieving over 25% improvement in consistency and frame quality.
|
|
15:25-15:30, Paper WeDT1.3 | |
RL-OGM-Parking: Lidar OGM-Based Hybrid Reinforcement Learning Planner for Autonomous Parking |
|
Wang, Zhitao | Shanghai Jiao Tong University |
Chen, Zhe | Shanghai Jiao Tong University |
Jiang, Mingyang | Shanghai Jiao Tong University |
Qin, Tong | Shanghai Jiao Tong University |
Yang, Ming | Shanghai Jiao Tong University |
Keywords: Autonomous Vehicle Navigation, Autonomous Agents, Reinforcement Learning
Abstract: Autonomous parking has become a critical application in automatic driving research and development. Parking operations often suffer from limited space and complex environments, requiring accurate perception and precise maneuvering. Traditional rule-based parking algorithms struggle to adapt to diverse and unpredictable conditions, while learning-based algorithms lack consistent and stable performance in various scenarios. Therefore, a hybrid approach is necessary that combines the stability of rule-based methods and the generalizability of learning-based methods. Recently, reinforcement learning (RL) based policy has shown robust capability in planning tasks. However, the simulation-to-reality (sim-to-real) transfer gap seriously blocks the real-world deployment. To address these problems, we employ a hybrid policy, consisting of a rule-based Reeds-Shepp (RS) planner and a learning-based reinforcement learning (RL) planner. A real-time LiDAR-based Occupancy Grid Map (OGM) representation is adopted to bridge the sim-to-real gap, leading the hybrid policy can be applied to real-world systems seamlessly. We conducted extensive experiments both in the simulation environment and real-world scenarios, and the result demonstrates that the proposed method outperforms pure rule-based and learning-based methods. The real-world experiment further validates the feasibility and efficiency of the proposed method.
|
|
15:30-15:35, Paper WeDT1.4 | |
Multi-Task Invariant Representation Imitation Learning for Autonomous Driving |
|
Peng, Jinghan | East China Normal University |
Yu, Xing | East China Normal University |
Wang, Jingwen | East China Normal University |
Tian, Lili | East China Normal University |
Dehui, Du | East China Normal University |
Keywords: Autonomous Vehicle Navigation, Imitation Learning, Representation Learning
Abstract: Imitation learning is a promising approach to acquiring autonomous driving policies by mimicking human driver behaviors. However, a major drawback of existing driving policies derived from imitation learning is their proneness to capturing spurious correlations, owing to the lack of an explicit causal model. Deploying such policies in unpredictable real-world environments poses severe risks, as spurious correlations may result in flawed decisions that compromise safety. To tackle this challenge, we introduce a novel approach called Multi-Task Invariant Representation Imitation Learning (MIRIL). MIRIL combines invariant learning with imitation learning to identify cross-environment invariant causal representations from driving demonstrations in various scenarios. These representations are then fed into multiple downstream branches for multi-task learning, including policy learning, perception prediction, invariant representation learning, and transition dynamics learning. Through the multi-task learning approach, the model not only makes consistent driving decisions across different environments but also perceives the vehicle's surroundings, thereby improving adaptability and robustness in diverse driving conditions. This enables MIRIL to effectively handle a wide range of driving scenarios, ensuring safety and efficiency. Supported by clear metrics, this paper details our comprehensive experimental setup, including datasets, benchmarks, and comparative analyses, underscoring the capability of MIRIL to significantly boost system generalization and excel in decision-making significantly.
|
|
15:35-15:40, Paper WeDT1.5 | |
Occ-LLM: Enhancing Autonomous Driving with Occupancy-Based Large Language Models |
|
Xu, Tianshuo | Hongkong University of Science and Technology (Guangzhou) |
Lu, Hao | HKUST-GZ |
Yan, Xu | Chinese University of Hong Kong, Shenzhen |
Cai, Yingjie | Huawei |
Liu, Bingbing | Huawei Technologies |
Chen, Yingcong | The University of Science and Technology (Guangzhou) |
Keywords: Autonomous Vehicle Navigation, Big Data in Robotics and Automation, Deep Learning Methods
Abstract: Large Language Models (LLMs) have made substantial advancements in the field of robotic and autonomous driving. This study presents the first Occupancy-based Large Language Model (Occ-LLM), which represents a pioneering effort to integrate LLMs with an important representation. To effectively encode occupancy as input for the LLM and address the category imbalances associated with occupancy, we propose Motion Separation Variational Autoencoder (MS-VAE). This innovative approach utilizes prior knowledge to distinguish dynamic objects from static scenes before inputting them into a tailored Variational Autoencoder (VAE). This separation enhances the model's capacity to concentrate on dynamic trajectories while effectively reconstructing static scenes. The efficacy of Occ-LLM has been validated across key tasks, including 4D occupancy forecasting, self-ego planning, and occupancy-based scene question answering. Comprehensive evaluations demonstrate that Occ-LLM significantly surpasses existing state-of-the-art methodologies, achieving gains of about 6% in Intersection over Union (IoU) and 4% in mean Intersection over Union (mIoU) for the task of 4D occupancy forecasting. These findings highlight the transformative potential of Occ-LLM in reshaping current paradigms within robotic and autonomous driving.
|
|
15:40-15:45, Paper WeDT1.6 | |
DISC: Dataset for Analyzing Driving Styles in Simulated Crashes for Mixed Autonomy |
|
Senthil Kumar, Sandip Sharan | University of Maryland, College Park |
Thalapanane, Sandeep | University of Maryland, College Park |
Appiya Dilipkumar Peethambari, Guru Nandhan | University of Maryland College Park |
Sri hari, Sourang | University of Maryland College Park |
Zheng, Laura | University of Maryland, College Park |
Lin, Ming C. | University of Maryland at College Park |
Keywords: Autonomous Vehicle Navigation, Virtual Reality and Interfaces, Data Sets for Robot Learning
Abstract: Handling pre-crash scenarios is still a major challenge for self-driving cars due to limited data & human-driving behavior datasets. We introduce DISC, one of the first datasets designed to capture various driving styles & behaviors in pre-crash scenarios for mixed autonomy analysis. DISC includes over 8 classes of driving styles/behaviors from hundreds of drivers navigating a simulated vehicle through a virtual city, encountering rare-event traffic scenarios. This dataset enables the classification of pre-crash human driving behaviors in unsafe conditions, supporting individualized trajectory prediction based on observed driving patterns. It offers the potential to improve autonomous vehicle safety by accounting for diverse human driving behaviors in stressful traffic & rare accident scenarios, which are otherwise difficult or risky to capture. By utilizing a VR-based driving simulator, TRAVERSE, data was collected through a driver-centric study involving human drivers encountering 12 simulated accident scenarios. This dataset fills a critical gap in human-centric driving data for rare events involving interactions with autonomous vehicles. It enables autonomous systems to better react to human drivers & optimize trajectory prediction in mixed autonomy environments involving both human-driven & self-driving cars. It includes essential data such as acceleration, braking & vehicle pose providing a foundation for machine-learning models in autonomous vehicles. In addition, individual driving behaviors are classified through a set of standardized questionnaires, carefully designed to identify & categorize driving behavior. We correlate data features with driving behaviors, showing that the simulated environment reflects real-world driving styles. DISC is the first dataset to capture how various driving styles respond to accident scenarios, offering significant potential to enhance autonomous vehicle safety and driving behavior analysis in mixed autonomy environments.
|
|
15:45-15:50, Paper WeDT1.7 | |
Real-World Automated Vehicle Longitudinal Stability Analysis: Controller Design and Field Test |
|
Ma, Ke | University of Wisconsin-Madison |
Zhang, Yuqin | Chang’an University |
Zhou, Hang | University of Wisconsin-Madison |
Liang, Zhaohui | University of Wisconsin Madison |
Li, Xiaopeng | University of Wisconsin-Madison |
Keywords: Autonomous Vehicle Navigation, Integrated Planning and Control, Robust/Adaptive Control
Abstract: Although extensive research has been conducted on modeling the stable longitudinal controller of automated vehicles (AVs) to dampen traffic oscillations, the real-world performance of these controllers in actual vehicles remains uncertain. In the operation of real-world AVs, the delay between actual dynamics and the commands prevents the controller's command from being effectively implemented to dampen traffic oscillations. Thus, this study adapts the designed controllers within an AV test platform to compare the theoretically stable conditions with the actual oscillation dampening performance. Initially, we compute the stable conditions for both the traditional car-following controller, which assumes no delay, and the longitudinal controller that accounts for the dynamic response of the vehicle. Through empirical experiments, we demonstrate that the longitudinal controller predicts vehicle stability more accurately than conventional car-following controller, showing an improvement from an average prediction accuracy rate of 0.59 to 0.91. Also, the experiments uncover specific delays inherent in dynamics systems, with a response delay of 0.34 seconds. Our work makes two principal contributions to the field of AV control systems. First, it empirically validates that the longitudinal model, which accounts for the vehicle's dynamic responses, offers a more precise representation of vehicular behavior. Second, the relatively brief response delay identified expands the stability region, thereby enhancing vehicle control and safety. The longitudinal controller is critical for enhancing AV performance and reliability in dampening traffic oscillations.
|
|
WeDT2 |
301 |
Learning-Based SLAM 1 |
Regular Session |
|
15:15-15:20, Paper WeDT2.1 | |
RoDyn-SLAM: Robust Dynamic Dense RGB-D SLAM with Neural Radiance Fields |
|
Jiang, Haochen | Fudan University |
Xu, Yueming | Fudan University |
Li, Kejie | The University of Oxford |
Feng, Jianfeng | Fudan University |
Zhang, Li | Fudan University |
Keywords: SLAM, Deep Learning Methods, Visual Learning
Abstract: Leveraging neural implicit representation to conduct dense RGB-D SLAM has been studied in recent years. However, this approach relies on a static environment assumption and does not work robustly within a dynamic environment due to the inconsistent observation of geometry and photometry. To address the challenges presented in dynamic environments, we propose a novel dynamic SLAM framework with neural radiance field. Specifically, we introduce a motion mask generation method to filter out the invalid sampled rays. This design effectively fuses the optical flow mask and semantic mask to enhance the precision of motion mask. To further improve the accuracy of pose estimation, we have designed a divide-and-conquer pose optimization algorithm that distinguishes between keyframes and non-keyframes. The proposed edge warp loss can effectively enhance the geometry constraints between adjacent frames. Extensive experiments are conducted on the two challenging datasets, and the results show that RoDyn-SLAM achieves state-of-the-art performance among recent neural RGB-D methods in both accuracy and robustness.
|
|
15:20-15:25, Paper WeDT2.2 | |
HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM |
|
Gong, Ziren | University of Bologna |
Tosi, Fabio | University of Bologna |
Zhang, Youmin | University of Bologna |
Mattoccia, Stefano | University of Bologna |
Poggi, Matteo | University of Bologna |
Keywords: SLAM, Mapping, Localization
Abstract: NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.
|
|
15:25-15:30, Paper WeDT2.3 | |
Gassidy: Gaussian Splatting SLAM in Dynamic Environments |
|
Wen, Long | Technical University of Munich |
Li, Shixin | Technical University of Munich |
Zhang, Yu | Technical University of Munich |
Huang, Yuhong | Technische Universität München |
Lin, Jianjie | Technische Universität München |
Pan, Fengjunjie | Technical University of Munich |
Bing, Zhenshan | Technical University of Munich |
Knoll, Alois | Tech. Univ. Muenchen TUM |
Keywords: SLAM, Localization, Mapping
Abstract: 3D Gaussian Splatting (3DGS) allows flexible adjustments to scene representation, enabling continuous optimization of scene quality during dense visual simultaneous localization and mapping (SLAM) in static environments. However, 3DGS faces challenges in handling environmental disturbances from dynamic objects with irregular movement, leading to degradation in both camera tracking accuracy and map reconstruction quality. To address this challenge, we develop an RGB-D dense SLAM which is called Gaussian Splatting SLAM in Dynamic Environments (Gassidy). This approach calculates Gaussians to generate rendering loss flows for each environmental component based on a designed photometric-geometric loss function. To distinguish and filter environmental disturbances, we iteratively analyze rendering loss flows to detect features characterized by changes in loss values between dynamic objects and static components. This process ensures a clean environment for accurate scene reconstruction. Compared to state-of-the-art SLAM methods, experimental results on open datasets show that Gassidy improves camera tracking precision by up to 97.9% and enhances map quality by up to 6%.
|
|
15:30-15:35, Paper WeDT2.4 | |
Large-Scale Gaussian Splatting SLAM |
|
Xin, Zhe | Meituan |
Wu, Chenyang | University of Science and Technology of China |
Huang, Penghui | Meituan |
Zhang, Yanyong | University of Science and Technology of China |
Mao, Yinian | Meituan-Dianping Group |
Huang, Guoquan (Paul) | University of Delaware |
Keywords: SLAM, Mapping, Deep Learning for Visual Perception
Abstract: The recently developed Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown encouraging and impressive results for visual SLAM. However, most representative methods require RGBD sensors and are only available for indoor environments. The robustness of reconstruction in large-scale outdoor scenarios remains unexplored. This paper introduces a large-scale 3DGS-based visual SLAM with stereo cameras, termed LSG-SLAM. The proposed LSG-SLAM employs a multi-modality strategy to estimate prior poses under large view changes. In tracking, we introduce feature-alignment warping constraints to alleviate the adverse effects of appearance similarity in rendering losses. For the scalability of large-scale scenarios, we introduce continuous Gaussian Splatting submaps to tackle unbounded scenes with limited memory. Loops are detected between GS submaps by place recognition and the relative pose between looped keyframes is optimized utilizing rendering and feature warping losses. After the global optimization of camera poses and Gaussian points, a structure refinement module enhances the reconstruction quality. With extensive evaluations on the EuRoc and KITTI datasets, LSG SLAM achieves superior performance over existing Neural, 3DGS-based, and even traditional approaches.
|
|
15:35-15:40, Paper WeDT2.5 | |
OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding |
|
Yang, Dianyi | Beijing Institute of Technology |
Gao, Yu | Beijing Institude of Technology |
Wang, Xihan | Beijing Institute of Technology |
Yue, Yufeng | Beijing Institute of Technology |
Yang, Yi | Beijing Institute of Technology |
Fu, Mengyin | Beijing Institute of Technology |
Keywords: SLAM, Semantic Scene Understanding, RGB-D Perception
Abstract: Recent advancements in 3D Gaussian Splatting have significantly improved the efficiency and quality of dense semantic SLAM. However, previous methods are generally constrained by limited-category pre-trained classifiers and implicit semantic representation, which hinder their performance in open-set scenarios and restrict 3D object-level scene understanding. To address these issues, we propose OpenGS-SLAM, an innovative framework that utilizes 3D Gaussian representation to perform dense semantic SLAM in open-set environments. Our system integrates explicit semantic labels derived from 2D foundational models into the 3D Gaussian framework, facilitating robust 3D object-level scene understanding. We introduce Gaussian Voting Splatting to enable fast 2D label map rendering and scene updating. Additionally, we propose a Confidence-based 2D Label Consensus method to ensure consistent labeling across multiple views. Furthermore, we employ a Segmentation Counter Pruning strategy to improve the accuracy of semantic scene representation. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method in scene understanding, tracking, and mapping, achieving 10× faster semantic rendering and 2× lower storage costs compared to existing methods.
|
|
15:40-15:45, Paper WeDT2.6 | |
SAP-SLAM: Semantic-Assisted Perception SLAM with 3D Gaussian Splatting |
|
Yang, Yuheng | Shenzhen International Graduate School, Tsinghua University |
Lin, Yudong | Tsinghua University |
Yang, Wenming | Tsinghua University |
Wang, Guijin | Tsinghua University |
Liao, Qingmin | Tsinghua University |
Keywords: SLAM, RGB-D Perception, Mapping
Abstract: The integration of 3D Gaussians has introduced a novel scene representation in Simultaneous Localization and Mapping (SLAM), characterized by explicit representation and differentiable rendering capabilities that enhance scene reconstruction and understanding. However, most current SLAM systems only exploit the basic representational capacity of 3D Gaussians, neglecting their potential to offer richer information and facilitate higher-dimensional scene comprehension. Furthermore, these systems often struggle with reconstruction when encountering rapid camera movements or depth missing. Drawing inspiration from 3D language field, which explores the intrinsic relationships among scene objects, we propose SAP-SLAM, a dense SLAM system that combines robust tracking, high-fidelity reconstruction, and advanced semantic understanding. Our approach leverages pre-trained visual models to extract semantic features, which are then fused, dimensionally reduced, and encoded into the 3D Gaussian model for optimization and rendering. The integration of these features improves the systems’ semantic comprehension and scene representation, ultimately enabling the creation of high-precision 3D semantic maps. Additionally, we introduce a semantic-guided Gaussian densification and pruning strategy, which uses semantic consistency to prioritize attention on poorly reconstructed areas, greatly improving performance in complex scenarios. SAP-SLAM achieves competitive results on both synthetic and real-world datasets, demonstrating superior capabilities in semantic understanding and reconstruction.
|
|
15:45-15:50, Paper WeDT2.7 | |
Gaussian-LIC: Real-Time Photo-Realistic SLAM with Gaussian Splatting and LiDAR-Inertial-Camera Fusion |
|
Lang, Xiaolei | Zhejiang University |
Li, Laijian | Zhejiang University |
Wu, Chenming | Baidu Research |
Zhao, Chen | Baidu Inc |
Liu, Lina | Zhejiang University |
Liu, Yong | Zhejiang University |
Lv, Jiajun | Zhejiang University |
Zuo, Xingxing | Caltech |
Keywords: Mapping, Sensor Fusion, SLAM
Abstract: In this paper, we present a real-time photo-realistic SLAM method based on marrying Gaussian Splatting with LiDAR-Inertial-Camera SLAM. Most existing radiance-field-based SLAM systems mainly focus on bounded indoor environments, equipped with RGB-D or RGB sensors. However, they are prone to decline when expanding to unbounded scenes or encountering adverse conditions, such as violent motions and changing illumination. In contrast, oriented to general scenarios, our approach additionally tightly fuses LiDAR, IMU, and camera for robust pose estimation and photo-realistic online mapping. To compensate for regions unobserved by the LiDAR, we propose to integrate both the triangulated visual points from images and LiDAR points for initializing 3D Gaussians. In addition, the modeling of the sky and varying camera exposure have been realized for high-quality rendering. Notably, we implement our system purely with C++ and CUDA, and meticulously design a series of strategies to accelerate the online optimization of the Gaussian-based scene representation. Extensive experiments demonstrate that our method outperforms its counterparts while maintaining real-time capability. Impressively, regarding photo-realistic mapping, our method with our estimated poses even surpasses all the compared approaches that utilize privileged ground-truth poses for mapping. Our code will be released on project page https://xingxingzuo.github.io/gaussian_lic.
|
|
WeDT3 |
303 |
Planning for Autonomous Racing |
Regular Session |
Chair: Miao, Fei | University of Connecticut |
Co-Chair: Laine, Forrest | Vanderbilt University |
|
15:15-15:20, Paper WeDT3.1 | |
Risk-Averse Model Predictive Control for Racing in Adverse Conditions |
|
Lew, Thomas | Toyota Research Institute |
Greiff, Marcus | Toyota Research Institute |
Djeumou, Franck | University of Texas, Austin |
Suminaka, Makoto | Toyota Research Institute |
Thompson, Michael | Toyota Research Institute |
Subosits, John | Toyota Research Institute |
Keywords: Planning under Uncertainty, Optimization and Optimal Control, Robot Safety
Abstract: Model predictive control (MPC) algorithms can be sensitive to model mismatch when used in challenging nonlinear control tasks. In particular, the performance of MPC for vehicle control at the limits of handling suffers when the underlying model overestimates the vehicle’s performance capability. In this work, we propose a risk-averse MPC framework that explicitly accounts for uncertainty over friction limits and tire parameters. Our approach leverages a sample-based approximation of an optimal control problem with a conditional value at risk (CVaR) constraint. This sample-based formulation enables planning with a set of expressive vehicle dynamics models using different tire parameters. Moreover, this formulation enables efficient numerical resolution via sequential quadratic programming and GPU parallelization. Experiments on a Lexus LC 500 show that risk-averse MPC unlocks reliable performance, while a deterministic baseline that plans using a single dynamics model may lose control of the vehicle in adverse road conditions.
|
|
15:20-15:25, Paper WeDT3.2 | |
Kineto-Dynamical Planning and Accurate Execution of Minimum-Time Maneuvers on Three-Dimensional Circuits |
|
Piccinini, Mattia | Technical University of Munich |
Taddei, Sebastiano | University of Trento, Politecnico Di Bari |
Betz, Johannes | Technical University of Munich |
Biral, Francesco | University of Trento |
Keywords: Motion and Path Planning, Autonomous Vehicle Navigation, Optimization and Optimal Control
Abstract: Online planning and execution of minimum-time maneuvers on three-dimensional (3D) circuits is an open challenge in autonomous vehicle racing. In this paper, we present an artificial race driver (ARD) to learn the vehicle dynamics, plan and execute minimum-time maneuvers on a 3D track. ARD integrates a novel kineto-dynamical (KD) vehicle model for trajectory planning with economic nonlinear model predictive control (E-NMPC). We use a high-fidelity vehicle simulator (VS) to compare the closed-loop ARD results with a minimum-lap-time optimal control problem (MLT-VS), solved offline with the same VS. Our ARD sets lap times close to the MLT-VS, and the new KD model outperforms a literature benchmark. Finally, we study the vehicle trajectories, to assess the re-planning capabilities of ARD under execution errors. A video with the main results is available as supplementary material.
|
|
15:25-15:30, Paper WeDT3.3 | |
Safety Guaranteed Robust Multi-Agent Reinforcement Learning with Hierarchical Control for Connected and Automated Vehicles |
|
Zhang, Zhili | University of Connecticut |
Ahmad, H M Sabbir | Boston University |
Sabouni, Ehsan | Boston University |
Sun, Yanchao | JPMorgan Chase |
Huang, Furong | University of Maryland |
Li, Wenchao | Boston University |
Miao, Fei | University of Connecticut |
Keywords: Integrated Planning and Control, Reinforcement Learning, Planning under Uncertainty
Abstract: We address the problem of coordination and control of Connected and Automated Vehicles (CAVs) in the presence of imperfect observations in mixed traffic environment. A commonly used approach is learning-based decision-making, such as reinforcement learning (RL). However, most existing safe RL methods suffer from two limitations: (i) they assume accurate state information, and (ii) safety is generally defined over the expectation of the trajectories. It remains challenging to design optimal coordination between multi-agents while ensuring hard safety constraints under system state uncertainties (e.g., those that arise from noisy sensor measurements, communication, or state estimation methods) at every time step. We propose a safety guaranteed hierarchical coordination and control scheme called Safe-RMM to address the challenge. Specifically, the high-level coordination policy of CAVs in mixed traffic environment is trained by the Robust Multi-Agent Proximal Policy Optimization (RMAPPO) method. Though trained without uncertainty, our method leverages a worst-case Q network to ensure the model's robust performances when state uncertainties are present during testing. The low-level controller is implemented using model predictive control (MPC) with robust Control Barrier Functions (CBFs) to guarantee safety through their forward invariance property. We compare our method with baselines in different road networks in the CARLA simulator. Results show that our method provides the best evaluated safety and efficiency in challenging mixed traffic environments with uncertainties.
|
|
15:30-15:35, Paper WeDT3.4 | |
Does Bilevel Optimization Result in More Competitive Racing Behavior? |
|
Cinar, Andrew | Vanderbilt University |
Laine, Forrest | Vanderbilt University |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents
Abstract: Two-vehicle racing is natural example of a competitive dynamic game. As with most dynamic games, there are many ways in which the underlying solution concept can be structured, resulting in different equilibrium concepts. The assumed solution concept influences the behaviors of two interacting players in racing. For example, blocking behavior emerges naturally in leader-follower play, but to achieve this in Nash play the costs would have to be chosen specifically to trigger this behavior. In this work, we develop a novel model for competitive two-player vehicle racing, represented as an equilibrium problem, complete with simplified aerodynamic drag and drafting effects, as well as position-dependent collision-avoidance responsibility. We use our model to explore how different solution concepts affect competitiveness. We develop a solution for bilevel optimization problems, enabling a large-scale empirical study comparing bilevel strategies (either as leader or follower), Nash equilibrium strategy and a single-player constant velocity baseline. We find the choice of strategies significantly affects competitive performance and safety.
|
|
15:35-15:40, Paper WeDT3.5 | |
Gate-Aware Online Planning for Two-Player Autonomous Drone Racing |
|
Zhao, Fangguo | Zhejiang University |
Mei, Jiahao | Zhejiang University of Technology |
Zhou, Jin | Zhejiang University |
Chen, Yuanyi | Zhejiang University |
Chen, Jiming | Zhejiang University |
Li, Shuo | Zhejiang University |
Keywords: Motion and Path Planning, Aerial Systems: Mechanics and Control
Abstract: The flying speed of autonomous quadrotors has increased significantly in the field of autonomous drone racing. However, most research primarily focuses on the aggressive flight of a single quadrotor, simplifying the racing gate traversal problem to a waypoint passing problem that neglects the orientations of the racing gates {or implicitly considers the waypoint direction during path planning}. In this paper, we propose a systematic method called Pairwise Model Predictive Control (PMPC) that can guide two quadrotors online to navigate racing gates with minimal time and without collisions. The flight task is initially simplified as a point-mass model waypoint passing problem to provide time optimal reference through an efficient two-step velocity search method. Subsequently, we utilize the spatial configuration of the racing track to compute the optimal heading at each gate, maximizing the visibility of subsequent gates for the quadrotors. To address varying gate orientations, we introduce a novel Magnetic Induction Line-based spatial curve to guide the quadrotors through racing gates of different orientations. Furthermore, we formulate a nonlinear optimization problem that uses the point-mass trajectory as initial values and references to enhance solving efficiency. The feasibility of the proposed method is validated through both simulation and real-world experiments. In real-world tests, the two quadrotors achieved a top speed of 6.1 m/s on a 7-waypoint racing track within a compact flying arena of 5 m * 4 m * 2 m.
|
|
15:40-15:45, Paper WeDT3.6 | |
TC-Driver: A Trajectory Conditioned Reinforcement Learning Approach to Zero-Shot Autonomous Racing (I) |
|
Ghignone, Edoardo | ETH |
Baumann, Nicolas | ETH |
Magno, Michele | ETH Zurich |
Keywords: Reinforcement Learning, Wheeled Robots, Deep Learning Methods
Abstract: Autonomous racing challenges perception, planning, and control algorithms, serving as a testbed for general autonomous driving. While traditional methods like MPC can generate optimal control sequences, they are sensitive to modeling parameter accuracy. This paper introduces TC-Driver, a Reinforcement Learning (RL) approach for robust control in autonomous racing, addressing tire parameter modeling inaccuracies. TC-Driver is conditioned by a trajectory from any high-level planner, combining RL’s learning capabilities with the reliability of traditional planning. Trained under varying tire conditions, it aims to generalize across different model parameters, enhancing real-world racing performance. Experimental results show TC-Driver improves generalization robustness compared to a state-of-the-art end-to-end architecture. It achieves a 29-fold improvement in crash ratio when facing model mismatch and successfully transfers to unseen tracks with new features, while the baseline fails. In physical deployment, TC-Driver demonstrates zero-shot Sim2Real capabilities, outperforming end-to-end agents 10-fold in crash ratio while maintaining similar driving characteristics in reality as in simulation. This hybrid RL architecture leverages traditional planning methods’ reliability while exploiting RL’s ability to handle model uncertainties, offering a robust solution for autonomous racing challenges.
|
|
15:45-15:50, Paper WeDT3.7 | |
Er.autopilot 1.1: A Software Stack for Autonomous Racing on Oval and Road Course Tracks (I) |
|
Raji, Ayoub | University of Modena and Reggio Emilia |
Caporale, Danilo | Centro Di Ricerca E. Piaggio |
Gatti, Francesco | Hipert Srl |
Toschi, Alessandro | University of Modena and Reggio Emlilia |
Musiu, Nicola | University of Modena and Reggio Emilia |
Verucchi, Micaela | University of Modena and Reggio Emilia |
Prignoli, Francesco | University of Modena and Reggio Emilia |
Malatesta, Davide | Technology Innovation Institute -Autonomous Robotics Research Ce |
Jesus, André Fialho | Technology Innovation Institute - Autonomous Robotics Research C |
Finazzi, Andrea | Korea Advanced Institute of Science and Technology |
Amerotti, Francesco | Università Di Pisa |
Bagni, Fabio | Hipert Srl |
Mascaro, Eugenio | University of Modena and Reggio Emilia |
Musso, Pietro | University of Modena and Reggio Emilia |
Marko, Bertogna | Unimore |
Keywords: Software Architecture for Robotic and Automation, Motion and Path Planning, Sensor Fusion
Abstract: In its first two seasons, the Indy Autonomous Challenge (IAC) organized a series of autonomous racing events across some of the most renowned oval racetracks, introducing various challenges including high-speed solo runs, static obstacle avoidance, and complex head-to-head passing competitions. In 2023, the challenge expanded to include a time-trial event on the iconic F1 Monza road course. This paper outlines the complete software architecture utilized by team TII Unimore Racing (formerly TII EuroRacing), er.autopilot 1.1, encompassing all modules necessary for static obstacle avoidance, active overtakes, achieving speeds over 75 m/s (270 km/h), and navigating complex road course tracks. Building on the previous version, this updated stack integrates new features such as LiDAR-based localization, lateral velocity estimation, a radar-based local controller for safe pull-overs, and refined vehicle modeling for the Model Predictive Controller. We present the overall results along with insights and lessons learned from the first two seasons, during which the team consistently achieved the podium.
|
|
WeDT4 |
304 |
Sensor Fusion 3 |
Regular Session |
Chair: Chen, Boyuan | Duke University |
|
15:15-15:20, Paper WeDT4.1 | |
FlatFusion: Delving into Details of Sparse Transformer-Based Camera-LiDAR Fusion for Autonomous Driving |
|
Zhu, Yutao | Shanghai Jiao Tong University |
Jia, Xiaosong | University of California, Berkeley |
Yang, Xinyu | Carnegie Mellon University |
Yan, Junchi | Shanghai Jiao Tong University |
Keywords: Autonomous Vehicle Navigation, Sensor Fusion, Object Detection, Segmentation and Categorization
Abstract: The integration of data from various sensor modalities (e.g. camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for transformer-based sparse camera-LiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.
|
|
15:20-15:25, Paper WeDT4.2 | |
A2DO: Adaptive Anti-Degradation Odometry with Deep Multi-Sensor Fusion for Autonomous Navigation |
|
Lai, Hui | Fudan University |
Chen, Qi | Fudan University |
Zhang, Junping | Fudan University |
Pu, Jian | Fudan University |
Keywords: Sensor Fusion, Localization, SLAM
Abstract: Accurate localization is essential for the safe and effective navigation of autonomous vehicles, and Simultaneous Localization and Mapping (SLAM) is a cornerstone technology in this context. However, The performance of the SLAM system can deteriorate under challenging conditions such as low light, adverse weather, or obstructions due to sensor degradation. We present A2DO, a novel end-to-end multi- sensor fusion odometry system that enhances robustness in these scenarios through deep neural networks. A2DO integrates LiDAR and visual data, employing a multi-layer, multi-scale feature encoding module augmented by an attention mechanism to mitigate sensor degradation dynamically. The system is pre- trained extensively on simulated datasets covering a broad range of degradation scenarios and fine-tuned on a curated set of real-world data, ensuring robust adaptation to complex scenarios. Our experiments demonstrate that A2DO maintains superior localization accuracy and robustness across various degradation conditions, showcasing its potential for practical implementation in autonomous vehicle systems.
|
|
15:25-15:30, Paper WeDT4.3 | |
Tunable Virtual IMU Frame by Weighted Averaging of Multiple Non-Collocated IMUs |
|
Gao, Yizhou | University of Toronto |
Barfoot, Timothy | University of Toronto |
Keywords: Sensor Fusion, Visual-Inertial SLAM, Localization
Abstract: We present a new method to combine several rigidly connected but physically separated IMUs through a weighted average into a single virtual IMU (VIMU). This has the benefits of (i) reducing process noise through averaging, and (ii) allowing for tuning the location of the VIMU. The VIMU can be placed to be coincident with, for example, a camera frame or GNSS frame, thereby offering a quality-of-life improvement for users. Specifically, our VIMU removes the need to consider any lever-arm terms in the propagation model. We also present a quadratic programming method for selecting the weights to minimize the noise of the VIMU while still selecting the placement of its reference frame. We tested our method in simulation and validated it on a real dataset. The results show that our averaging technique works for IMUs with large separation and performance gain is observed in both the simulation and the real experiment compared to using only a single IMU.
|
|
15:30-15:35, Paper WeDT4.4 | |
WildFusion: Multimodal Implicit 3D Reconstructions in the Wild |
|
Liu, Yanbaihui | Duke University |
Chen, Boyuan | Duke University |
Keywords: Sensor Fusion, Mapping, Field Robots
Abstract: We propose WildFusion, a novel approach for 3D scene reconstruction in unstructured, in-the-wild environments using multimodal implicit neural representations. WildFusion integrates signals from LiDAR, RGB camera, contact microphones, tactile sensors, and IMU. This multimodal fusion generates comprehensive, continuous environmental representations, including pixel-level geometry, color, semantics, and traversability. Through real-world experiments on legged robot navigation in challenging forest environments, WildFusion demonstrates improved route selection by accurately predicting traversability. Our results highlight its potential to advance robotic navigation and 3D mapping in complex outdoor terrains.
|
|
15:35-15:40, Paper WeDT4.5 | |
Steering Prediction Via a Multi-Sensor System for Autonomous Racing |
|
Zhou, Zhuyun | University of Burgundy (Université De Bourgogne), France |
Wu, Zongwei | University of Wurzburg |
Bolli, Florian | University of Zurich |
Boutteau, Rémi | Université De Rouen Normandie |
Yang, Fan | Univ. Bourgogne Franche-Comté |
Timofte, Radu | University of Wurzburg |
Ginhac, Dominique | Univ Burgundy |
Delbruck, Tobi | Univ. of Zurich & ETH Zurich |
Keywords: Sensor Fusion, Deep Learning for Visual Perception, Intelligent Transportation Systems
Abstract: Autonomous racing has rapidly gained research attention. Traditionally, racing cars rely on 2D LiDAR as their primary visual system. In this work, we explore the integration of an event camera with the existing system to provide enhanced temporal information. Our goal is to fuse the 2D LiDAR data with event data in an end-to-end learning framework for steering prediction, which is crucial for autonomous racing. To the best of our knowledge, this is the first study addressing this challenging research topic. We start by creating a multisensor dataset specifically for steering prediction. Using this dataset, we establish a benchmark by evaluating various SOTA fusion methods. Our observations reveal that existing methods often incur substantial computational costs. To address this, we apply low-rank techniques to propose a novel, efficient, and effective fusion design. We introduce a new fusion learning policy to guide the fusion process, enhancing robustness against misalignment. Our fusion architecture provides better steering prediction than LiDAR alone, significantly reducing the RMSE from 7.72 to 1.28. Compared to the second-best fusion method, our work represents only 11% of the learnable parameters while achieving better accuracy. The source code, dataset, and benchmark will be released to promote future research.
|
|
15:40-15:45, Paper WeDT4.6 | |
Are Doppler Velocity Measurements Useful for Spinning Radar Odometry? |
|
Lisus, Daniil | University of Toronto |
Burnett, Keenan | University of Toronto |
Yoon, David Juny | University of Toronto |
Poulton, Richard | Navtech Radar |
Marshall, John | Navtech Radar |
Barfoot, Timothy | University of Toronto |
Keywords: Autonomous Vehicle Navigation, Sensor Fusion, Range Sensing
Abstract: Spinning, frequency-modulated continuous-wave (FMCW) radars with 360 degree coverage have been gaining popularity for autonomous-vehicle navigation. However, unlike 'fixed' automotive radar, commercially available spinning radar systems typically do not produce radial velocities due to the lack of repeated measurements in the same direction and the fundamental hardware setup. To make these radial velocities observable, we modified the firmware of a commercial spinning radar to use triangular frequency modulation. In this paper, we develop a novel way to use this modulation to extract radial Doppler velocity measurements from consecutive azimuths of a radar intensity scan, without any data association. We show that these noisy, error-prone measurements contain enough information to provide good ego-velocity estimates, and incorporate these estimates into different modern odometry pipelines. We extensively evaluate the pipelines on over 110 km of driving data in progressively more geometrically challenging autonomous-driving environments. We show that Doppler velocity measurements improve odometry in well-defined geometric conditions and enable it to continue functioning even in severely geometrically degenerate environments, such as long tunnels.
|
|
WeDT5 |
305 |
Aerial Robots: Mechanics and Control 2 |
Regular Session |
Chair: Khorrami, Farshad | New York University Tandon School of Engineering |
Co-Chair: Garcia de Marina, Hector | Universidad De Granada |
|
15:15-15:20, Paper WeDT5.1 | |
Skater: A Novel Bi-Modal Bi-Copter Robot for Adaptive Locomotion in Air and Diverse Terrain |
|
Lin, Junxiao | Zhejiang University |
Zhang, Ruibin | Zhejiang University |
Pan, Neng | Zhejiang University |
Xu, Chao | Zhejiang University |
Gao, Fei | Zhejiang University |
Keywords: Aerial Systems: Applications, Aerial Systems: Mechanics and Control, Motion Control
Abstract: In this letter, we present a novel bi-modal bi-copter robot called Skater, which is adaptable to air and various ground surfaces. Skater consists of a bi-copter moving along its longitudinal direction with two passive wheels on both sides. Using a longitudinally arranged bi-copter as the unified actuation system for both aerial and ground modes, this robot not only keeps a concise and lightweight mechanism but also possesses exceptional terrain traversing capability and strong steering capacity. Moreover, leveraging the vectored thrust characteristic of bi-copters, the Skater can actively generate the centripetal force needed for steering, enabling it to achieve stable movement even on slippery surfaces. Furthermore, we model the comprehensive dynamics of the Skater, analyze its differential flatness, and introduce a controller using nonlinear model predictive control for trajectory tracking. The outstanding performance of the system is verified by extensive real-world experiments and benchmark comparisons.
|
|
15:20-15:25, Paper WeDT5.2 | |
Inverse Kinematics on Guiding Vector Fields for Robot Path Following |
|
Zhou, Yu | INRIA |
Bautista, Jesús | Universidad De Granada |
Yao, Weijia | Hunan University |
Garcia de Marina, Hector | Universidad De Granada |
Keywords: Aerial Systems: Mechanics and Control, Motion Control, Autonomous Vehicle Navigation
Abstract: Inverse kinematics is a fundamental technique for motion and positioning control in robotics, typically applied to end-effectors. In this paper, we extend the concept of inverse kinematics to guiding vector fields for path following in autonomous mobile robots. The desired path is defined by its implicit equation, i.e., by a collection of points belonging to one or more zero-level sets. These level sets serve as a reference to construct an error signal that drives the guiding vector field toward the desired path, enabling the robot to converge and travel along the path by following such a vector field. We start with the formal exposition on how inverse kinematics can be applied to guiding vector fields for single-integrator robots in an m-dimensional Euclidean space. Then, we leverage inverse kinematics to ensure that the level-set error signal behaves as a linear system, facilitating control over the robot's transient motion toward the desired path and allowing for the injection of feed-forward signals to induce precise motion behavior along the path. We then propose solutions to the theoretical and practical challenges of applying this technique to unicycles with constant speeds to follow 2D paths with precise transient control. We finish by validating the predicted theoretical results through real flights with fixed-wing drones.
|
|
15:25-15:30, Paper WeDT5.3 | |
Dragonfly Drone: A Novel Tilt-Rotor Aerial Platform with Body-Morphing Capability |
|
Hameed, Syed Waqar | NTU |
Liew Jun Jie, Alex | Nanyang Technological University |
Nursultan, Imanberdiyev | Agency for Science, Technology and Research (A*STAR) |
Camci, Efe | Institute for Infocomm Research |
Yau, Wei-Yun | I2R |
Feroskhan, Mir | Nanyang Technological University |
Keywords: Aerial Systems: Mechanics and Control, Aerial Systems: Applications, Grasping
Abstract: The development of unmanned aerial vehicles (UAVs) with extended maneuverability has unlocked new applications such as complex inspection tasks at height. In this work, we introduce the Dragonfly drone, a novel tilt-rotor body-morphing UAV, capable of altering its shape and orientation without compromising its position tracking. Unlike most existing UAV designs that only target at decoupling position and orientation control, Dragonfly can also perform unique body-morphing in flight, featuring all six degrees of freedom in every morphology. This enables navigation into tight gaps with irregular shapes, conforming to obstacles of varying geometries, and maintaining physical contact with uneven surfaces. Such capabilities make our design particularly effective for complex inspection tasks at height, such as pipe or bridge inspection. Our contributions include the mechanical design of the system, the modeling and control strategies employed, and the real-robot experiments with a prototype platform. See Dragonfly drone in action: https://youtu.be/YxoV_Qt_5XE.
|
|
15:30-15:35, Paper WeDT5.4 | |
An Omnidirectional Non-Tethered Aerial Prototype with Fixed Uni-Directional Thrusters |
|
Hamandi, Mahmoud | New York University Abu Dhabi |
Ali, Abdullah Mohamed | New York University Abu Dhabi |
Kyriakopoulos, Kostas | New York University - Abu Dhabi |
Tzes, Anthony | New York University Abu Dhabi |
Khorrami, Farshad | New York University Tandon School of Engineering |
Keywords: Aerial Systems: Mechanics and Control, Product Design, Development and Prototyping, Aerial Systems: Applications
Abstract: This paper presents the first worldwide functional prototype omnidirectional multi-rotor aerial vehicle with fixed uni-directional thrusters, with an on-board power source. An optimization algorithm computes the positions and orientations of the propellers in the body frame of the prototype to achieve the omnidirectional capability, while minimizing the platform's weight and the required thrust to hover at any orientation, in addition to other construction requirements. The effect of the aerodynamic interaction between the different propellers is identified experimentally, and the ensuing results are included in the optimization algorithm to avoid such interactions during flight. The prototype's performance is assessed in real experiments demonstrating the decoupling between the forces and moments of the drone, its ability to track concurrently independent positions and orientations, and its ability to hover at a fixed position while rotating.
|
|
15:35-15:40, Paper WeDT5.5 | |
TrofyBot: A Transformable Rolling and Flying Robot with High Energy Efficiency |
|
Lai, Mingwei | Zhejiang University |
Ye, Yuqian | Stanford University |
Wu, Hanyu | ETH Zurich |
Xuan, Chice | Huzhou Institute of Zhejiang University, Huzhou |
Zhang, Ruibin | Zhejiang University |
Ren, Qiuyu | Zhejiang University |
Xu, Chao | Zhejiang University |
Gao, Fei | Zhejiang University |
Cao, Yanjun | Zhejiang University, Huzhou Institute of Zhejiang University |
Keywords: Aerial Systems: Mechanics and Control, Dynamics, Motion Control
Abstract: Terrestrial and aerial bimodal vehicles have gained significant interest due to their energy efficiency and versatile maneuverability across different domains. However, most existing passive-wheeled bimodal vehicles rely on attitude regulation to generate forward thrust, which inevitably results in energy waste on producing lifting force. In this work, we propose a novel passive-wheeled bimodal vehicle called TrofyBot that can rapidly change the thrust direction with a single servo motor and a transformable parallelogram linkage mechanism (TPLM). Cooperating with a bidirectional force generation module (BFGM) for motors to produce bidirectional thrust, the robot achieves flexible mobility as a differential driven rover on the ground. This design achieves 95.37% energy saving efficiency in terrestrial locomotion, allowing the robot continuously move on the ground for more than two hours in current setup. Furthermore, the design obviates the need for attitude regulation and therefore provides a stable sensor field of view (FoV). We model the bimodal dynamics for the system, analyze its differential flatness property, and design a controller based on hybrid model predictive control for trajectory tracking. A prototype is built and extensive experiments are conducted to verify the design and the proposed controller, which achieves high energy efficiency and seamless transition between modes.
|
|
15:40-15:45, Paper WeDT5.6 | |
Dense Fixed-Wing Swarming Using Receding-Horizon NMPC |
|
Madabushi, Varun | Georgia Institute of Technology |
Kopel, Yocheved | The Johns Hopkins University Applied Physics Laboratory |
Polevoy, Adam | Johns Hopkins University Applied Physics Lab |
Moore, Joseph | Johns Hopkins University |
Keywords: Aerial Systems: Mechanics and Control, Distributed Robot Systems, Swarm Robotics
Abstract: In this paper, we present an approach for controlling a team of agile fixed-wing aerial vehicles in close proximity to one another. Our approach relies on receding-horizon nonlinear model predictive control (NMPC) to plan maneuvers across an expanded flight envelope to enable inter-agent collision avoidance. To facilitate robust collision avoidance and characterize the likelihood of inter-agent collisions, we compute a statistical bound on the probability of the system leaving a tube around the planned nominal trajectory. Finally, we propose a metric for evaluating highly dynamic swarms and use this metric to evaluate our approach. We successfully demonstrated our approach through both simulation and hardware experiments, and to our knowledge, this the first time close-quarters swarming has been achieved with physical aerobatic fixed-wing vehicles.
|
|
15:45-15:50, Paper WeDT5.7 | |
HPA-MPC: Hybrid Perception-Aware Nonlinear Model Predictive Control for Quadrotors with Suspended Loads |
|
Sarvaiya, Mrunal | Agile Robotics and Perception Lab, NYU |
Li, Guanrui | New York University |
Loianno, Giuseppe | New York University |
Keywords: Aerial Systems: Applications, Aerial Systems: Perception and Autonomy
Abstract: Quadrotors equipped with cable-suspended loads represent a versatile, low-cost, and energy efficient solution for aerial transportation, construction, and manipulation tasks. However, their real-world deployment is hindered by several challenges. The system is difficult to control because it is nonlinear, underactuated, involves hybrid dynamics due to slack-taut cable modes, and evolves on complex configuration spaces. Additionally, it is crucial to estimate the full state and the cable’s mode transitions in real-time using on-board sensors and computation. To address these challenges, we present a novel Hybrid Perception-Aware Nonlinear Model Predictive Control (HPA-MPC) control approach for quadrotors with suspended loads. Our method considers the complete hybrid system dynamics and includes a perception-aware cost to ensure the payload remains visible in the robot’s camera during navigation. Furthermore, the full state and hybrid dynamics’ transitions are estimated using onboard sensors. Experimental results demonstrate that our approach enables stable load tracking control, even during slack-taut transitions, and operates entirely onboard. The experiments also show that the perception-aware term effectively keeps the payload in the robot’s camera field of view when a human operator interacts with the load.
|
|
WeDT6 |
307 |
Perception for Grasping and Manipulation |
Regular Session |
Chair: Dudek, Gregory | McGill University |
Co-Chair: Zhi, Weiming | Carnegie Mellon University |
|
15:15-15:20, Paper WeDT6.1 | |
Unifying Representation and Calibration with 3D Foundation Models |
|
Zhi, Weiming | Carnegie Mellon University |
Tang, Haozhan | Carnegie Mellon University |
Zhang, Tianyi | Carnegie Mellon University |
Johnson-Roberson, Matthew | Carnegie Mellon University |
Keywords: Perception for Grasping and Manipulation, Deep Learning for Visual Perception
Abstract: Representing the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulator-mounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag. However, recent advances in computer vision have led to the development of 3D foundation models. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot's end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot's coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.
|
|
15:20-15:25, Paper WeDT6.2 | |
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models |
|
Dey, Sombit | InSAIT, Sofia University |
Zaech, Jan-Nico | Sofia University |
Nikolov, Nikolay | Imperial College London |
Van Gool, Luc | ETH Zurich |
Paudel, Danda Pani | ETH Zurich |
Keywords: Perception for Grasping and Manipulation, Deep Learning in Grasping and Manipulation, Deep Learning Methods
Abstract: Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is poten- tially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA – which requires the adaptation of the visual backbones during initial training – to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77% and 66% for grasping and lifting in visual OOD tasks. We will make our source code and OOD evaluation framework publicly available
|
|
15:25-15:30, Paper WeDT6.3 | |
MonoDiff9D: Monocular Category-Level 9D Object Pose Estimation Via Diffusion Model |
|
Liu, Jian | National Engineering Research Center of Robot Vision Perception |
Sun, Wei | Hunan University |
Yang, Hui | Hunan University |
Zheng, Jin | Central South University |
Geng, Zichen | The University of Western Australia |
Rahmani, Hossein | Lancaster University |
Mian, Ajmal | University of Western Australia |
Keywords: Perception for Grasping and Manipulation, Semantic Scene Understanding, Computer Vision for Automation
Abstract: Object pose estimation is a core means for robots to understand and interact with their environment. For this task, monocular category-level methods are attractive as they require only a single RGB camera. However, current methods rely on shape priors or CAD models of the intra-class known objects. We propose a diffusion-based monocular category-level 9D object pose generation method, MonoDiff9D. Our motivation is to leverage the probabilistic nature of diffusion models to alleviate the need for shape priors, CAD models, or depth sensors for intra-class unknown object pose estimation. We first estimate coarse depth via DINOv2 from the monocular image in a zero-shot manner and convert it into a point cloud. We then fuse the global features of the point cloud with the input image and use the fused features along with the encoded time step to condition MonoDiff9D. Finally, we design a transformer-based denoiser to recover the object pose from Gaussian noise. Extensive experiments on two popular benchmark datasets show that MonoDiff9D achieves state-of-the-art monocular category-level 9D object pose estimation accuracy without the need for shape priors or CAD models at any stage. Our code will be made public at https://github.com/CNJianLiu/MonoDiff9D.
|
|
15:30-15:35, Paper WeDT6.4 | |
A Full-Optical Pre-Touch Dual-Modal and Dual-Mechanism (PDM²) Sensor for Robotic Grasping |
|
Fang, Cheng | Texas A&M University |
Yan, Zhiyu | Texas A&M University |
Guo, Fengzhi | Texas A&M University |
Li, Shuangliang | Texas A&M University |
Song, Dezhen | Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) |
Zou, Jun | Texas A&M University |
Keywords: Perception for Grasping and Manipulation, Range Sensing, Grasping
Abstract: We report a new full-optical pre-touch dual-modal and dual-mechanism (PDM²) sensor based on an air-coupled fiber-tip surface micromachined optical ultrasound transducer (SMOUT). Compared with the ring-shaped piezoelectric acoustic receivers in previous PDM² sensors, the acoustic signal received by the new fiber-tip SMOUT is readout optically, which is naturally resistant to surrounding electromagnetic interference (EMI) and making the complex grounding and shielding unnecessary. In addition, the new fiber-tip SMOUT receiver has a much smaller size, which makes it possible to further miniaturize the sensor package into a more compact structure. For verification, a prototype of the full-optical PDM² sensor has been designed, fabricated, and characterized. The experimental results show that even with the much smaller acoustic receiver, the new sensor can still achieve comparable ranging and material/structure sensing performances with the previous ones. Therefore, the new full-optical PDM² sensor design is promising to provide a practical and miniaturized solution for ranging and material/structure sensing to assist robotic grasping of unknown objects.
|
|
15:35-15:40, Paper WeDT6.5 | |
Learning Active Tactile Perception through Belief-Space Control |
|
Tremblay, Jean-François | McGill University |
Meger, David Paul | McGill University |
Hogan, Francois | Massachusetts Institute of Technology |
Dudek, Gregory | McGill University |
Keywords: Perception for Grasping and Manipulation, Model Learning for Control, Planning under Uncertainty
Abstract: Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate the object's physical parameters using a differentiable Bayesian filtering algorithm and 2) develop an exploration policy using an information-gathering model predictive controller. We evaluate our method on three simulated tasks where the goal is to estimate a desired object property (mass, height or toppling height) through physical interaction. We find that our method is able to discover policies that efficiently gather information about the desired property in an intuitive manner. Finally, we validate our method on a real robot system for the height estimation task, where our method is able to successfully learn and execute an information-gathering policy from scratch.
|
|
15:40-15:45, Paper WeDT6.6 | |
Detection of Fast-Moving Objects with Neuromorphic Hardware |
|
Ziegler, Andreas | University of Tübingen |
Vetter, Karl | University of Tübingen |
Gossard, Thomas | University of Tübingen |
Tebbe, Jonas | University of Tübingen |
Otte, Sebastian | University of Lübeck |
Zell, Andreas | University of Tübingen |
Keywords: Neurorobotics, Object Detection, Segmentation and Categorization, Machine Learning for Robot Control
Abstract: Neuromorphic Computing (NC) and Spiking Neural Networks (SNNs) in particular are often viewed as the next generation of neural networks. NC is a novel bio-inspired paradigm for energy efficient neural computation, often relying on SNNs in which neurons communicate via spikes in a sparse, event-based manner. This communication via spikes can be exploited by neuromorphic hardware implementations very effectively and results in drastic reductions of energy consumption and latency in contrast to regular GPU-based neural networks. In recent years, neuromorphic hardware has become more accessible and the support of learning frameworks has improved. However, available hardware is partially still experimental, and it is not transparent what these solutions are effectively capable of, how they integrate into real world robotics applications, and how they realistically benefit energy efficiency and latency. In this work, we provide the robotics research community with an overview of what is possible with SNNs on neuromorphic hardware focusing on real-time processing. We introduce a benchmark of three popular neuromorphic hardware devices for the task of event based object detection. Moreover, we show that an SNN on a neuromorphic hardware is able to run in real-time in a closed loop robotic system embedded within a challenging table tennis robot scenario.
|
|
15:45-15:50, Paper WeDT6.7 | |
Grasp, See, and Place: Efficient Unknown Object Rearrangement with Policy Structure Prior |
|
Xu, Kechun | Zhejiang University |
Zhou, Zhongxiang | Zhejiang University |
Wu, Jun | Zhejiang University |
Lu, Haojian | Zhejiang University |
Xiong, Rong | Zhejiang University |
Wang, Yue | Zhejiang University |
Keywords: Manipulation Planning, Deep Learning in Robotics and Automation, Grasping, Intelligent and Flexible Manufacturing
Abstract: We focus on the task of unknown object rearrangement, where a robot is supposed to re-configure the objects into a desired goal configuration specified by an RGB-D image. Recent works explore unknown object rearrangement systems by incorporating learning-based perception modules. However, they are sensitive to perception error, and pay less attention to task-level performance. In this paper, we aim to develop an effective system for unknown object rearrangement amidst perception noise. We theoretically reveal the noisy perception impacts grasp and place in a decoupled way, and show such a decoupled structure is valuable to improve task optimality. We propose GSP, a dual-loop system with the decoupled structure as prior. For the inner loop, we learn a see policy for self-confident in-hand object matching. For the outer loop, we learn a grasp policy aware of object matching and grasp capability guided by task-level rewards. We leverage the foundation model CLIP for object matching, policy learning and self-termination. A series of experiments indicate that GSP can conduct unknown object rearrangement with higher completion rates and fewer steps.
|
|
WeDT7 |
309 |
Perception 2 |
Regular Session |
Chair: Li, Xiaopeng | University of Wisconsin-Madison |
|
15:15-15:20, Paper WeDT7.1 | |
LightStereo: Channel Boost Is All You Need for Efficient 2D Cost Aggregation |
|
Guo, Xianda | School of Computer Science, Wuhan University |
Zhang, Chenming | Waytous |
Zhang, Youmin | University of Bologna |
Zheng, Wenzhao | Tsinghua University |
Nie, Dujun | Huazhong University of Science and Technology |
Poggi, Matteo | University of Bologna |
Chen, Long | Chinese Academy of Sciences |
Keywords: Computer Vision for Transportation
Abstract: We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs and 17 ms of runtime, and ranks 1st on KITTI 2015 among real-time models. Our comprehensive analysis reveals the effect of 2D cost aggregation for stereo matching, paving the way for real-world applications of efficient stereo systems. Code is available at https://github.com/XiandaGuo/OpenStereo.
|
|
15:20-15:25, Paper WeDT7.2 | |
SurfaceAug: Toward Versatile, Multimodally Consistent Ground Truth Sampling |
|
Rubel, Ryan | University of Southern California |
Clark, Nathan | Noblis, Inc |
Dudash, Andrew | Noblis |
Keywords: Computer Vision for Transportation, Object Detection, Segmentation and Categorization, AI-Enabled Robotics
Abstract: Despite recent advances in both model architectures and data augmentation, multimodal object detectors still barely outperform their LiDAR-only counterparts. This shortcoming has been attributed to a lack of sufficiently powerful multimodal data augmentation. To address this, we present SurfaceAug, a novel ground truth sampling algorithm. SurfaceAug pastes objects by resampling both images and point clouds, enabling object-level transformations in both modalities. We evaluate our algorithm by training a multimodal detector on KITTI and compare its performance to previous works. We show experimentally that SurfaceAug demonstrates promising improvements on car detection tasks.
|
|
15:25-15:30, Paper WeDT7.3 | |
Uncertainty-Guided Enhancement on Driving Perception System Via Foundation Models |
|
Yang, Yunhao | University of Texas at Austin |
Hu, Yuxin | Cruise |
Ye, Mao | The University of Texas, Austin |
Zhang, Zaiwei | Cruise |
Lu, Zhichao | Cruise LLC |
Xu, Yi | Northeastern University |
Topcu, Ufuk | The University of Texas at Austin |
Snyder, Ben | Cruise |
Keywords: Computer Vision for Transportation, Calibration and Identification, Probability and Statistical Methods
Abstract: Multimodal foundation models offer promising advancements for enhancing driving perception systems, but their high computational and financial costs pose challenges. We develop a method that leverages foundation models to refine predictions from existing driving perception models---such as enhancing object classification accuracy---while minimizing the frequency of using these resource-intensive models. The method quantitatively characterizes uncertainties in the perception model's predictions and engages the foundation model only when these uncertainties exceed a pre-specified threshold. Specifically, it characterizes uncertainty by calibrating the perception model’s confidence scores into theoretical lower bounds on the probability of correct predictions using conformal prediction. Then, it sends images to the foundation model and queries for refining the predictions only if the theoretical bound of the perception model's outcome is below the threshold. Additionally, we propose a temporal inference mechanism that enhances prediction accuracy by integrating historical predictions, leading to tighter theoretical bounds. The method demonstrates a 10 to 15 percent improvement in prediction accuracy and reduces the number of queries to the foundation model by 50 percent, based on quantitative evaluations from driving datasets.
|
|
15:30-15:35, Paper WeDT7.4 | |
Complementary Information Guided Occupancy Prediction Via Multi-Level Representation Fusion |
|
Xu, Rongtao | Institute of Automation, Chinese Academy of Sciences, Beijing, C |
Lin, Jinzhou | Beijing University of Posts and Communications |
Zhou, Jialei | Tongji University |
Dong, Jiahua | Shenyang Institute of Automation Chinese Academy of Sciences |
Wang, Changwei | Casia |
Wang, Ruisheng | University of Calgary |
Guo, Li | BUPT |
Xu, Shibiao | Beijing University of Posts and Telecommunications |
Liang, Xiaodan | Sun Yat-Sen University |
Keywords: Computer Vision for Transportation, Computer Vision for Automation
Abstract: Camera-based occupancy prediction is a mainstream approach for 3D perception in autonomous driving, aiming to infer complete 3D scene geometry and semantics from 2D images. Almost existing methods focus on improving performance through structural modifications, such as lightweight backbones and complex cascaded frameworks, with good yet limited performance. Few studies explore from the perspective of representation fusion, leaving the rich diversity of features in 2D images underutilized. Motivated by this, we propose CIGOcc, a two-stage occupancy prediction framework based on multi-level representation fusion. CIGOcc extracts segmentation, graphics, and depth features from an input image and introduces a deformable multi-level fusion mechanism to fuse these three multi-level features. Additionally, CIGOcc incorporates knowledge distilled from SAM to further enhance prediction accuracy. Without increasing training costs, CIGOcc achieves state-of-the-art performance on the SemanticKITTI benchmark. The code is provided in the supplementary material and will be released at https://github.com/VitaLemonTea1/CIGOcc.
|
|
15:35-15:40, Paper WeDT7.5 | |
Domain Adaptation-Based Crossmodal Knowledge Distillation for 3D Semantic Segmentation |
|
Kang, Jialiang | Peking University |
Wang, Jiawen | Peking University |
Luo, Dingsheng | Peking University |
Keywords: Computer Vision for Transportation, Sensor Fusion, Visual Learning
Abstract: Semantic segmentation of 3D LiDAR data plays a pivotal role in autonomous driving. Traditional approaches rely on extensive annotated data for point cloud analysis, incurring high costs and time investments. In contrast, real-world image datasets offer abundant availability and substantial scale. To mitigate the burden of annotating 3D LiDAR point clouds, we propose two crossmodal knowledge distillation methods: Unsupervised Domain Adaptation Knowledge Distillation (UDAKD) and Feature and Semantic-based Knowledge Distillation (FSKD). Leveraging readily available spatio-temporally synchronized data from cameras and LiDARs in autonomous driving scenarios, we directly apply a pretrained 2D image model to unlabeled 2D data. Through crossmodal knowledge distillation with known 2D-3D correspondence, we actively align the output of the 3D network with the corresponding points of the 2D network, thereby obviating the necessity for 3D annotations. Our focus is on preserving modality-general information while filtering out modality-specific details during crossmodal distillation. To achieve this, we deploy self-calibrated convolution on 3D point clouds as the foundation of our domain adaptation module. Rigorous experimentation validates the effectiveness of our proposed methods, consistently surpassing the performance of state-of-the-art approaches in the field. Code will be released upon publication.
|
|
15:40-15:45, Paper WeDT7.6 | |
Nonlinear Motion-Guided and Spatio-Temporal Aware Network for Unsupervised Event-Based Optical Flow |
|
Liu, Zuntao | Northeastern University of China |
Zhuang, Hao | Northeastern University |
Jiang, Junjie | Northeastern University |
Song, Yuhang | Northeastern University - China |
Fang, Zheng | Northeastern University |
Keywords: Computer Vision for Transportation, Visual Learning, Deep Learning for Visual Perception
Abstract: Event cameras have the potential to capture continuous motion information over time and space, making them well-suited for optical flow estimation. However, most existing learning-based methods for event-based optical flow adopt frame-based techniques, ignoring the spatio-temporal characteristics of events. Additionally, these methods assume linear motion between consecutive events within the loss time window, which increases optical flow errors in long-time sequences. In this work, we observe that rich spatio-temporal information and accurate nonlinear motion between events are crucial for event-based optical flow estimation. Therefore, we propose E-NMSTFlow, a novel unsupervised event-based optical flow network focusing on long-time sequences. We propose a Spatio-Temporal Motion Feature Aware (STMFA) module and an Adaptive Motion Feature Enhancement (AMFE) module, both of which utilize rich spatio-temporal information to learn spatio-temporal data associations. Meanwhile, we propose a nonlinear motion compensation loss that utilizes the accurate nonlinear motion between events to improve the unsupervised learning of our network. Extensive experiments demonstrate the effectiveness and superiority of our method. Remarkably, our method ranks first among unsupervised learning methods on the MVSEC and DSEC-Flow datasets.
|
|
15:45-15:50, Paper WeDT7.7 | |
V2X-DG: Domain Generalization for Vehicle-To-Everything Cooperative Perception |
|
Li, Baolu | Cleveland State University |
Xu, Zongzhe | Carnegie Mellon University |
Li, Jinlong | Cleveland State University |
Liu, Xinyu | Cleveland State University |
Fang, Jianwu | Xian Jiaotong University |
Li, Xiaopeng | University of Wisconsin-Madison |
Yu, Hongkai | Cleveland State University |
Keywords: Computer Vision for Transportation, Intelligent Transportation Systems, Deep Learning for Visual Perception
Abstract: LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem for LiDAR-based V2X cooperative perception (V2X-DG) based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.
|
|
WeDT8 |
311 |
Representation Learning 3 |
Regular Session |
|
15:15-15:20, Paper WeDT8.1 | |
IMOST: Incremental Memory Mechanism with Online Self-Supervision for Continual Traversability Learning |
|
Ma, Kehui | Shanghai Jiao Tong University |
Sun, Zhen | Shanghai Jiao Tong University |
Xiong, Chaoran | Shanghai Jiao Tong University |
Zhu, Qiumin | Shanghai Jiao Tong University |
Wang, Kewei | Shanghai Jiao Tong University |
Pei, Ling | Shanghai Jiao Tong University |
Keywords: Visual Learning, Incremental Learning, Learning from Demonstration
Abstract: Traversability estimation is the foundation of path planning for a general navigation system. However, complex and dynamic environments pose challenges for the latest methods using self-supervised learning (SSL) technique. Firstly, existing SSL-based methods generate sparse annotations lacking detailed boundary information. Secondly, their strategies focus on hard samples for rapid adaptation, leading to forgetting and biased predictions. In this work, we propose IMOST, a continual traversability learning framework composed of two key modules: incremental dynamic memory (IDM) and self-supervised annotation (SSA). By mimicking human memory mechanisms, IDM allocates novel data samples to new clusters according to information expansion criterion. It also updates clusters based on diversity rule, ensuring a representative characterization of new scene. This mechanism enhances scene-aware knowledge diversity while maintaining a compact memory capacity. The SSA module, integrating FastSAM, utilizes point prompts to generate complete annotations in real time which reduces training complexity. Furthermore, IMOST has been successfully deployed on the quadruped robot, with performance evaluated during the online learning process. Experimental results on both public and self-collected datasets demonstrate that our IMOST outperforms current state-of-the-art method, maintains robust recognition capabilities and adaptability across various scenarios. The code is available at https://github.com/SJTU-MKH/OCLTrav.
|
|
15:20-15:25, Paper WeDT8.2 | |
SparseDrive: End-To-End Autonomous Driving Via Sparse Scene Representation |
|
Sun, Wenchao | Tsinghua University |
Lin, Xuewu | Horizon |
Shi, Yining | Tsinghua University |
Zhang, Chuang | Tsinghua University |
Wu, Haoran | Tsinghua University |
Zheng, Sifa | Tsinghua University |
Keywords: Imitation Learning, Computer Vision for Transportation
Abstract: The well-established modular autonomous driving system is decoupled into different standalone tasks, e.g. perception, prediction and planning, suffering from information loss and error accumulation across modules. In contrast, end-to-end paradigms unify multi-tasks into a fully differentiable framework, allowing for optimization in a planning-oriented spirit. Despite the great potential of end-to-end paradigms, both the performance and efficiency of existing methods are not satisfactory, particularly in terms of planning safety. We attribute this to the computationally expensive BEV (bird's eye view) features and the straightforward design for prediction and planning. To this end, we explore the sparse representation and review the task design for end-to-end autonomous driving, proposing a new paradigm named SparseDrive. Concretely, SparseDrive consists of a symmetric sparse perception module and a parallel motion planner. The sparse perception module unifies detection, tracking and online mapping with a symmetric model architecture, learning a fully sparse representation of the driving scene. For motion prediction and planning, we review the great similarity between these two tasks, leading to a parallel design for motion planner. Based on this parallel design, which models planning as a multi-modal problem, we propose a hierarchical planning selection strategy, which incorporates a collision-aware rescore module, to select a rational and safe trajectory as the final planning output. With such effective designs, SparseDrive surpasses previous state-of-the-arts by a large margin in performance of all tasks, while achieving much higher training and inference efficiency.
|
|
15:25-15:30, Paper WeDT8.3 | |
MOTION TRACKS: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning |
|
Ren, Juntao | Cornell University |
Sundaresan, Priya | Stanford University |
Sadigh, Dorsa | Stanford University |
Choudhury, Sanjiban | Cornell University |
Bohg, Jeannette | Stanford University |
Keywords: Imitation Learning, Learning from Demonstration, Transfer Learning
Abstract: Teaching robots to autonomously complete everyday tasks remains a challenge. Imitation Learning (IL) is a powerful approach that imbues robots with skills via demonstrations, but is limited by the labor-intensive process of collecting teleoperated robot data. Human videos offer a scalable alternative, but it remains difficult to directly train IL policies from them due to the lack of robot action labels. To address this, we propose to represent actions as short-horizon 2D trajectories on an image. These actions, or motion tracks, capture the predicted direction of motion for both human hands and robot end-effectors. We instantiate an IL policy called Motion Track Policy (MT-π) which receives image observations and outputs motion tracks as actions. By leveraging this unified, cross-embodiment action space, MT-π completes tasks with high success given just minutes of human video and limited additional robot demonstrations. At test time, we predict motion tracks from two camera views, recovering 6DoF trajectories via multi-view synthesis. MT-π achieves an average success rate of 86.5% across 4 real-world tasks, outperforming state-of-the-art IL baselines which do not leverage human data or our action space by 40%, and generalizes to scenarios seen only in human videos. Code and videos are available on our website (https://portal-cornell.github.io/motion_track_policy/).
|
|
15:30-15:35, Paper WeDT8.4 | |
Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation |
|
Wu, Kun | Syracuse University |
Zhu, Yichen | Midea Group |
Li, Jinming | Shanghai University |
Wen, Junjie | East China Normal University |
Liu, Ning | Beijing Innovation Center of Humanoid Robotics |
Xu, Zhiyuan | Midea Group |
Tang, Jian | Midea Group (Shanghai) Co., Ltd |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation
Abstract: Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose Discrete Policy, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26% higher than Diffusion Policy and 15% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.
|
|
15:35-15:40, Paper WeDT8.5 | |
AnyCar to Anywhere: Learning Universal Dynamics Model for Agile and Adaptive Mobility |
|
Xiao, Wenli | Carnegie Mellon University |
Xue, Haoru | University of California Berkeley |
Tao, Tony | Carnegie Mellon University |
Kalaria, Dvij | Carnegie Mellon University |
Dolan, John M. | Carnegie Mellon University |
Shi, Guanya | Carnegie Mellon University |
Keywords: Representation Learning, Machine Learning for Robot Control, Data Sets for Robot Learning
Abstract: Recent works in the robot learning community have successfully introduced generalist models capable of controlling various robot embodiments across a wide range of tasks, such as navigation and locomotion. However, achieving agile control, which pushes the limits of robotic performance, still relies on specialist models that require extensive parameter tuning. To leverage generalist-model adaptability and flexibility while achieving specialist-level agility, we propose AnyCar, a transformer-based generalist dynamics model designed for agile control of various wheeled robots. To collect training data, we unify multiple simulators and leverage different physics backends to simulate vehicles with diverse sizes, scales, and physical properties across various terrains. With robust training and real-world fine-tuning, our model enables precise adaptation to different vehicles, even in the wild and under large state estimation errors. In real-world experiments, AnyCar shows both few-shot and zero-shot generalization across a wide range of vehicles and environments, where our model, combined with a sampling-based MPC, outperforms specialist models by up to 54%. These results represent a key step toward building a foundation model for agile wheeled robot control. AnyCar is fully open-source to support further research.
|
|
15:40-15:45, Paper WeDT8.6 | |
UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation |
|
Tang, Yihe | Stanford University |
Huang, Wenlong | Stanford University |
Wang, Yingke | Stanford University |
Li, Chengshu | Stanford University |
Yuan, Roy | Stanford University |
Zhang, Ruohan | Stanford University |
Wu, Jiajun | Stanford University |
Fei-Fei, Li | Stanford University |
Keywords: Representation Learning, Deep Learning for Visual Perception, Sensorimotor Learning
Abstract: Understanding fine-grained object affordances is imperative for robots to manipulate objects in unstructured environments given open-ended task instructions. However, existing methods on visual affordance predictions often rely on manually-annotated data or conditions only on predefined set of tasks. We introduce Unsupervised Affordance Distillation (UAD), a method for distilling affordance knowledge from foundation models into a task-conditioned affordance model without any manual annotations. By leveraging the complementary strengths of large vision models and vision-language models, UAD automatically annotates a large-scale dataset with detailed pairs. Training only a lightweight task-conditioned decoder atop frozen features, UAD exhibits notable generalization to in-the-wild robotic scenes as well as to various human activities despite only being trained on rendered objects in simulation. Using affordance provided by UAD as the observation space, we show an imitation learning policy that demonstrates promising generalization to unseen object instances, object categories, and even variations in task instructions after training on as few as 10 demonstrations.
|
|
15:45-15:50, Paper WeDT8.7 | |
Learning Dynamics of a Ball with Differentiable Factor Graph and Roto-Translational Invariant Representations |
|
Xiao, Qingyu | Georgia Institute of Technology |
Wu, Zixuan | Georgia Institute of Technology |
Gombolay, Matthew | Georgia Institute of Technology |
Keywords: Representation Learning, Probabilistic Inference, SLAM
Abstract: Robots in dynamic environments need fast, accurate models of how objects move in their environments to support agile planning. In sports such as ping pong, analytical models often struggle to accurately predict ball trajectories with spins due to complex aerodynamics, elastic behaviors, and the challenges of modeling sliding and rolling friction. On the other hand, despite the promise of data-driven methods, machine learning struggles to make accurate, consistent predictions without precise input. In this paper, we propose an end-to-end learning framework that can jointly train a dynamics model and a factor graph estimator. Our approach leverages a Gram-Schmidt (GS) process to extract roto-translational invariant representations to improve the model performance, which can further reduce the validation error compared to data augmentation method. Additionally, we propose a network architecture that enhances nonlinearity by using self-multiplicative bypasses in the layer connections. By leveraging these novel methods, our proposed approach predicts the ball's position with an RMSE of 37.2 mm of the paddle radius at the apex after the first bounce, and 71.5 mm after the second bounce.
|
|
WeDT9 |
312 |
Motion Planning 5 |
Regular Session |
Chair: Xiao, Xuesu | George Mason University |
|
15:15-15:20, Paper WeDT9.1 | |
Semi-Supervised Active Learning for Semantic Segmentation in Unknown Environments Using Informative Path Planning |
|
Rückin, Julius | University of Bonn |
Magistri, Federico | University of Bonn |
Stachniss, Cyrill | University of Bonn |
Popovic, Marija | TU Delft |
Keywords: Motion and Path Planning, Deep Learning for Visual Perception, Semantic Scene Understanding
Abstract: Semantic segmentation enables robots to perceive and reason about their environments beyond geometry. Most of such systems build upon deep learning approaches. As autonomous robots are commonly deployed in initially unknown environments, pre-training on static datasets cannot always capture the variety of domains and limits the robot’s perception performance during missions. Recently, self-supervised and fully supervised active learning methods emerged to improve robotic vision. These approaches rely on large in-domain pre-training datasets or require substantial human labelling effort. We propose a planning method for semi-supervised active learning of semantic segmentation that substantially reduces human labelling requirements compared to fully supervised approaches. We leverage an adaptive map-based planner guided towards the frontiers of unexplored space with high model uncertainty, collecting training data for human labelling. A key aspect of our approach is to combine the sparse high-quality human labels with pseudo labels automatically extracted from highly certain environment map areas. Experimental results show that our method reaches segmentation performance close to fully supervised approaches with drastically reduced human labelling effort while outperforming self-supervised approaches.
|
|
15:20-15:25, Paper WeDT9.2 | |
FutureNet-LOF: Joint Trajectory Prediction and Lane Occupancy Field Prediction with Future Context Encoding |
|
Wang, Mingkun | Peking University |
Ren, Xiaoguang | Academy of Military Sciences |
Jin, Ruochun | National University of Defense Technology |
Li, Minglong | National University of Defense Technology |
Zhang, Xiaochuan | Academy of Military Science |
Yu, Changqian | Meituan |
Wang, Mingxu | Fudan University |
Yang, Wenjing | State Key Laboratory of High Performance Computing (HPCL), Schoo |
Keywords: Motion and Path Planning, Computer Vision for Transportation, Computer Vision for Automation
Abstract: Most prior motion prediction endeavors in autonomous driving have inadequately encoded future scenarios, leading to predictions that may fail to accurately capture the diverse movements of agents (e.g., vehicles or pedestrians). To address this, we propose FutureNet, which explicitly integrates initially predicted trajectories into the future scenario and further encodes these future contexts to enhance subsequent forecasting. Additionally, most previous motion forecasting works have focused on predicting independent futures for each agent. However, safe and smooth autonomous driving requires accurately predicting the diverse future behaviors of numerous surrounding agents jointly in complex dynamic environments. Given that all agents occupy certain potential travel spaces and possess lane driving priority, we propose Lane Occupancy Field (LOF), a new representation with lane semantics for motion forecasting in autonomous driving. LOF can simultaneously capture the joint probability distribution of all road participants' future spatial-temporal positions. Due to the high compatibility between lane occupancy field prediction and trajectory prediction, we propose a novel network for joint prediction of these two tasks. Our approach ranks 1st on two large-scale motion forecasting benchmarks: Argoverse 1 and Argoverse 2, while it is also the champion method of the CVPR 2024 Argoverse 2 motion forecasting challenge.
|
|
15:25-15:30, Paper WeDT9.3 | |
Hierarchical Reinforcement Learning for Safe Mapless Navigation with Congestion Estimation |
|
Gao, Jianqi | Harbin Institute of Technology (Shenzhen) |
Pang, Xizheng | Harbin Institute of Technology, Shenzhen |
Liu, Qi | Northeastern University |
Li, Yanjie | Harbin Institute of Technology (Shenzhen) |
Keywords: Motion and Path Planning, Collision Avoidance, Reinforcement Learning
Abstract: Reinforcement learning-based mapless navigation holds significant potential. However, it faces challenges in indoor environments with local minima area. This paper introduces a safe mapless navigation framework utilizing hierarchical reinforcement learning (HRL) to enhance navigation through such areas. The high-level policy creates a sub-goal to direct the navigation process. Notably, we have developed a sub-goal update mechanism that considers environment congestion, efficiently avoiding the entrapment of the robot in local minimum areas. The low-level motion planning policy, trained through safe reinforcement learning, outputs real-time control instructions based on acquired sub-goal. Specifically, to enhance the robot's environmental perception, we introduce a new obstacle encoding method that evaluates the impact of obstacles on the robot's motion planning. To validate the performance of our HRL-based navigation framework, we conduct simulations in office, home, and restaurant environments. The findings demonstrate that our HRL-based navigation framework excels in both static and dynamic scenarios. Finally, we implement the HRL-based navigation framework on a TurtleBot3 robot for physical validation experiments, which exhibits its strong generalization capabilities.
|
|
15:30-15:35, Paper WeDT9.4 | |
Hierarchical End-To-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning |
|
Lu, Siyi | Central South University |
He, Lei | Tsinghua University |
Li, Shengbo Eben | Tsinghua University |
Luo, Yugong | Tsinghua University |
Wang, Jianqiang | Tsinghua University |
Li, Keqiang | Tsinghua University |
Keywords: Integrated Planning and Control, Reinforcement Learning, Vision-Based Navigation
Abstract: End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird’s-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%. The code of our approach is publicly available at https://github.com/CBDES-e2e/PEDe2e-driving
|
|
15:35-15:40, Paper WeDT9.5 | |
Multi-Goal Motion Memory |
|
Lu, Yuanjie | George Mason University |
Das, Dibyendu | George Mason University |
Plaku, Erion | U.S. National Science Foundation |
Xiao, Xuesu | George Mason University |
Keywords: Integrated Planning and Learning, Motion and Path Planning, Deep Learning Methods
Abstract: Autonomous mobile robots (e.g., warehouse logistics robots) often need to traverse complex, obstacle-rich, and changing environments to reach multiple fixed goals (e.g., warehouse shelves). Traditional motion planners need to calculate the entire multi-goal path from scratch in response to changes in the environment, which results in a large consumption of computing resources. This process is not only time-consuming but also may not meet real-time requirements in application scenarios that require rapid response to environmental changes. In this paper, we provide a novel Multi-Goal Motion Memory technique that allows robots to use previous planning experiences to accelerate future multi-goal planning in changing environments. Specifically, our technique predicts collision-free and dynamically-feasible trajectories and distances between goal pairs to guide the sampling process to build a roadmap, to inform a Traveling Salesman Problem (TSP) solver to compute a tour, and to efficiently produce motion plans. Experiments conducted with a vehicle and a snake-like robot in obstacle-rich environments show that the proposed Motion Memory technique can substantially accelerate planning speed by up to 90%. Furthermore, the solution quality is comparable to state-of-the-art algorithms and even better in some environments.
|
|
15:40-15:45, Paper WeDT9.6 | |
Dual-BEV Nav: Dual-Layer BEV-Based Heuristic Path Planning for Robotic Navigation in Unstructured Outdoor Environments |
|
Zhang, Jianfeng | East China Normal University |
Dong, Hanlin | East China Normal University |
Yang, Jian | Information Engineering University |
Liu, Jiahui | Fujian Normal University |
Huang, Shibo | East China Normal University |
Li, Ke | Information Engineering University |
Tang, Xuan | East China Normal University |
Wei, Xian | East China Normal University |
You, Xiong | Information Engineering University |
Keywords: Integrated Planning and Learning, Imitation Learning, Vision-Based Navigation
Abstract: Path planning with strong environmental adaptability plays a crucial role in robotic navigation in unstructured outdoor environments, especially in the case of low-quality location and map information. The path planning ability of a robot depends on the identification of the traversability of global and local ground areas. In real-world scenarios, the complexity of outdoor open environments makes it difficult for robots to identify the traversability of ground areas that lack a clearly defined structure. Moreover, most existing methods have rarely analyzed the integration of local and global traversability identifications in unstructured outdoor scenarios. To address this problem, we propose a novel method, Dual-BEV Nav, first introducing Bird’s Eye View (BEV) representations into local planning to generate high-quality traversable paths. Then, these paths are projected into the global traversability probability map generated by the global BEV planning model to obtain the optimal path. By integrating the traversability from both local and global BEV, we establish a dual-layer BEV heuristic planning paradigm, enabling long-distance navigation in unstructured outdoor environments. We test our approach through both public dataset evaluations and real-world robot deployments, yielding promising results. Compared to baselines, the Dual-BEV Nav improved temporal distance prediction accuracy by up to 18.26%. In the real-world deployment, under conditions significantly different from the training set and with notable occlusions in the global BEV, the Dual-BEV Nav successfully achieved a 65-meter-long outdoor navigation. Further analysis demonstrates that the local BEV representation significantly enhances the rationality of the planning, while the global BEV probability map ensures the robustness of the overall plan
|
|
15:45-15:50, Paper WeDT9.7 | |
Risk-Aware Integrated Task and Motion Planning for Versatile Snake Robots under Localization Failures |
|
M. Jasour, Ashkan | MIT |
Daddi, Guglielmo | Politecnico Di Torino |
Endo, Masafumi | Keio University |
Vaquero, Tiago | JPL, Caltech |
Paton, Michael | Jet Propulsion Laboratory |
Strub, Marlin Polo | NASA Jet Propulsion Laboratory |
Corpino, Sabrina | Politecnico Di Torino |
Ingham, Michel | NASA-JPL |
Ono, Masahiro | California Institute of Technology |
Thakker, Rohan | Nasa's Jet Propulsion Laboratory, Caltech |
Keywords: Planning under Uncertainty, Task and Motion Planning, Biologically-Inspired Robots
Abstract: Snake robots enable mobility through extreme terrains and confined environments in terrestrial and space applications. However, robust perception and localization for snake robots remain an open challenge due to the proximity of the sensor payload to the ground coupled with a limited field of view. To address this issue, we propose Blind-motion with Intermittently Scheduled Scans (BLISS) which combines proprioception-only mobility with intermittent scans to be resilient against both localization failures and collision risks. BLISS is formulated as an integrated task and motion planning (TAMP) problem that leads to a chance-constrained hybrid partially observable Markov decision process (CC-HPOMDP), known to be computationally intractable due to the curse of history. Our novelty lies in reformulating CC-HPOMDP as a tractable, convex mixed integer linear program. This allows us to solve BLISS-TAMP significantly faster and jointly derive optimal task-motion plans. Simulations and hardware experiments on the EELS snake robot show our method achieves over an order of magnitude computational improvement compared to state-of-the-art POMDP planners and > 50% better navigation time optimality versus classical two-stage planners.
|
|
WeDT10 |
313 |
Multi-Robot Systems 2 |
Regular Session |
Co-Chair: Bhattacharya, Sourabh | Iowa State University |
|
15:15-15:20, Paper WeDT10.1 | |
Formation Rotation and Assignment: Avoiding Obstacles in Multi-Robot Scenarios |
|
Zhang, Zhan | Northwestern Polytechnical University |
Li, Yan | Northwestern Polytechnical University |
Gu, Zhiyang | School of Automation, Northwestern Polytechnical University |
Wang, Zhong | Northwestern Polytechnical University |
Keywords: Multi-Robot Systems, Cooperating Robots, Collision Avoidance
Abstract: Current formation assignment and optimization methods frequently overlook the influence of rotational dynamics, limiting their operational flexibility. Additionally, these methods typically neglect the impact of obstacles, which may also hinder their effectiveness in obstacle-rich environments. To address these limitations, this paper proposes a novel approach that incorporates both rotation and assignment into the formation optimization of multi-robot systems. This approach allows for dynamic adjustment of the formation orientation and introduces a collaborative obstacle avoidance strategy. This strategy is specifically designed to assess and integrate the influence of obstacles into the optimization process, thereby enhancing the ability to maneuver around obstacles. Simulation experiments, including scenarios involving the encirclement of stationary and moving targets, validate the effectiveness of the proposed algorithm. The proposed algorithm outperforms non-rotational methods in maintaining formations under the influence of various types of obstacles while encircling targets. Furthermore, real-world flight experiments demonstrate the robustness and feasibility of the algorithm.
|
|
15:20-15:25, Paper WeDT10.2 | |
A Streamlined Heuristic for the Problem of Min-Time Coverage in Constricted Environments (I) |
|
Kim, Young-In | ISyE, Georgia Tech |
Reveliotis, Spiridon | Georgia Institute of Technology |
Keywords: Robotics in Hazardous Fields, Planning, Scheduling and Coordination, Optimization and Optimal Control
Abstract: The problem of min-time coverage in constricted environments concerns the employment of robotic fleets to support routine inspection and service operations within well-structured but constricted environments. In our previous work we have provided a detailed definition of this problem, specifying the objectives and the constraints involved, a Mixed Integer Programming (MIP) formulation for it, a formal analysis of its worst-case computational complexity, and additional structural properties of the optimal solutions that enable a partial relaxation of the original MIP formulation which preserves optimal performance. We have further employed these structural results towards the development of a construction heuristic for this problem. But while the worst-case computational complexity of the construction heuristic is polynomial with respect to the size of the problem-defining elements, its practical scalability has been limited by the requirement to formulate and solve a large number of linear programming formulations. In order to address this issue, this work presents a modified version of the heuristic that significantly reduces the computational times involved. Furthermore, we develop a local search method that further improves the solution obtained from the modified heuristic.
|
|
15:25-15:30, Paper WeDT10.3 | |
Scalable Multi-Agent Surveillance: A Kernel-Based Approach |
|
Mandal, Shashwata | Iowa State University |
Bhattacharya, Sourabh | Iowa State University |
Keywords: Motion and Path Planning, Computational Geometry, Multi-Robot Systems
Abstract: In this work, we address the deployment problem for a team of mobile guards that tries to maintain a line-of-sight with an unpredictable mobile intruder. First, we present a computationally efficient strategy for generating a set of points, called `kernel points`, that covers the entire polygon. We then introduce a polygon partitioning technique based on the location of the kernel points. Next, we propose control laws for a free guard to track an intruder in general polygonal environments based on the analysis of a pursuit-evasion game around a single corner basepaper. Finally, we present several variations of the proposed control laws that include capture and search, and illustrate the improvement in the overall visual footprint of the team of mobile guards based on extensive simulations.
|
|
15:30-15:35, Paper WeDT10.4 | |
Contingency Formation Planning for Interactive Drone Light Shows |
|
Au, Tsz-Chiu | Texas State University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Multi-Robot Systems, Swarm Robotics
Abstract: One of the most appealing applications of drone swarms is drone light shows, in which a group of drones displays an animation by showing a sequence of light patterns in the sky. In this paper, we consider using drone swarms as video game platforms and utilize planning techniques to display pixels in animations correctly while providing a fast response to user inputs. We devise a new sampling algorithm to solve a contingency formation planning problem, which aims to find a contingency formation plan such that drones can always move to the correct positions to display every possible future frame regardless of the user inputs in the future. The algorithm provides interactivity by preemptively relocating hidden drones, which move in stealth mode to the locations of all possible future frames. Our experiments show that the size of the frame buffer and the ratio between the number of drones and the number of pixels can greatly affect the performance of our system.
|
|
15:35-15:40, Paper WeDT10.5 | |
Design of a Formation Control System to Assist Human Operators in Flying a Swarm of Robotic Blimps |
|
Wu, Tianfu | Hong Kong University of Science and Technology |
Fu, Jiaqi | Beijing Jiaotong University |
Meng, Wugang | Hong Kong University of Science and Technology |
Cho, Sungjin | Sunchon National University |
Zhan, Huanzhe | Emory University |
Zhang, Fumin | Hong Kong University of Science and Technology |
Keywords: Swarm Robotics, Aerial Systems: Applications, Autonomous Vehicle Navigation
Abstract: Formation control is essential for swarm robotics, enabling coordinated behavior in complex environments. In this paper, we introduce a novel formation control system for an indoor blimp swarm using a specialized leader-follower approach enhanced with a dynamic leader-switching mechanism. This strategy allows any blimp to take on the leader role, distributing maneuvering demands across the swarm and enhancing overall formation stability. Only the leader blimp is manually controlled by a human operator, while follower blimps use onboard monocular cameras and a laser altimeter for relative position and altitude estimation. A leader-switching scheme is proposed to assist the human operator to maintain stability of the swarm, especially when a sharp turn is performed. Experimental results confirm that the leader-switching mechanism effectively maintains stable formations and adapts to dynamic indoor environments while assisting human operator.
|
|
15:40-15:45, Paper WeDT10.6 | |
Multi-Agent Exploration with Similarity Score Map and Topological Memory |
|
Lee, Eun Sun | Seoul National University |
Kim, Young Min | Seoul National University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Vision-Based Navigation, Multi-Robot Systems
Abstract: Multi-robot exploration can be a collaborative solution for navigating a large-scale area. However, it is not trivial to optimally assign tasks among agents because the state dynamically changes while the local observations of multiple agents concurrently update the global map. Furthermore, the individual robots may not have access to accurate relative poses of others or global layouts. We propose an efficient spatial abstraction for multi-agent exploration based on topological graph memories. Each agent creates a topological graph, a lightweight spatial representation whose nodes contain minimal image features. The information in graphs is aggregated to compare individual nodes and is used to update the similarity scores in real-time. Then, the agents effectively fulfill distributed task goals by examining the dynamic similarity scores of frontier nodes. We further exploit extracted visual features to refine the relative poses among topological graphs. Our proposed pipeline can efficiently explore large-scale areas among various scene and robot configurations without sharing precise geometric information.
|
|
15:45-15:50, Paper WeDT10.7 | |
DREAM: Decentralized Real-Time Asynchronous Probabilistic Trajectory Planning for Collision-Free Multi-Robot Navigation in Cluttered Environments |
|
Şenbaşlar, Baskın | NVIDIA |
Sukhatme, Gaurav | University of Southern California |
Keywords: Collision Avoidance, Multi-Robot Systems, Motion and Path Planning, Probabilistic Trajectory Planning
Abstract: Collision-free navigation in cluttered environments with static and dynamic obstacles is essential for many multi-robot tasks.Dynamic obstacles may also be interactive, i.e., their behavior varies based on the behavior of other entities.We propose a novel representation for interactive behavior of dynamic obstacles and a decentralized real-time multi-robot trajectory planning algorithm allowing inter-robot collision avoidance as well as static and dynamic obstacle avoidance.Our planner simulates the behavior of dynamic obstacles, accounting for interactivity.We account for the perception inaccuracy of static and prediction inaccuracy of dynamic obstacles.We handle asynchronous planning between teammates and message delays, drops, and re-orderings.We evaluate our algorithm in simulations using 25400 random cases and compare it against three state-of-the-art baselines using 2100 random cases.Our algorithm achieves up to 1.68x success rate using as low as 0.28x time in single-robot, and up to 2.15x success rate using as low as 0.36x time in multi-robot cases compared to the best baseline.We implement our planner on real quadrotors to show its real-world applicability.
|
|
WeDT11 |
314 |
Foundation Models for Manipulation |
Regular Session |
Chair: Rivera, Corban | Johns Hopkins University Applied Physics Lab |
Co-Chair: Kober, Jens | TU Delft |
|
15:15-15:20, Paper WeDT11.1 | |
Enhancing the LLM-Based Robot Manipulation through Human-Robot Collaboration |
|
Liu, Haokun | The University of Tokyo |
Zhu, Yaonan | University of Tokyo |
Kato, Kenji | National Center for Geriatrics and Gerontology |
Tsukahara, Atsushi | Shinshu University |
Kondo Izumi, Kondo Izumi | National Center for Geriatrics and Gerontology |
Aoyama, Tadayoshi | Nagoya University |
Hasegawa, Yasuhisa | Nagoya University |
Keywords: AI-Enabled Robotics, Human-Robot Collaboration
Abstract: Large Language Models (LLMs) are gaining popularity in the field of robotics. However, LLM-based robots are limited to simple, repetitive motions due to the poor integration between language models, robots, and the environment. This paper proposes a novel approach to enhance the performance of LLM-based autonomous manipulation through Human-Robot Collaboration (HRC). The approach involves using a prompted GPT-4 language model to decompose high-level language commands into sequences of motions that can be executed by the robot. The system also employs a YOLO-based perception algorithm, providing visual cues to the LLM, which aids in planning feasible motions within the specific environment. Additionally, an HRC method is proposed by combining teleoperation and Dynamic Movement Primitives (DMP), allowing the LLM-based robot to learn from human guidance. Real-world experiments have been conducted using the Toyota Human Support Robot for manipulation tasks. The outcomes indicate that tasks requiring complex trajectory planning and reasoning over environments can be efficiently accomplished through the incorporation of human demonstrations.
|
|
15:20-15:25, Paper WeDT11.2 | |
In-Context Learning Enables Robot Action Prediction in LLMs |
|
Yin, Yida | University of California, Berkeley |
Wang, Zekai | University of California, Berkeley |
Sharma, Yuvan | University of California, Berkeley |
Niu, Dantong | University of California, Berkeley |
Darrell, Trevor | UC Berkeley |
Herzig, Roei | Tel Aviv University |
Keywords: Deep Learning Methods, Learning from Demonstration
Abstract: Recently, Large Language Models (LLMs) have achieved remarkable success using in-context learning (ICL) in the language domain. However, leveraging the ICL capabilities within off-the-shelf LLMs to directly predict robot actions remains largely unexplored. In this paper, we introduce RobotPrompt, a framework that enables off-the-shelf text-only LLMs to directly predict robot actions through ICL without training. Our approach first heuristically identifies keyframes that capture important moments from an episode. Next, we extract end-effector actions from these keyframes as well as the estimated initial object poses, and both are converted into textual descriptions. Finally, we construct a structured template to form ICL demonstrations from these textual descriptions and a task instruction. This enables an LLM to directly predict robot actions at test time. Through extensive experiments and analysis, RobotPrompt shows stronger performance over zero-shot and ICL baselines in simulated and real-world settings.
|
|
15:25-15:30, Paper WeDT11.3 | |
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models |
|
Yu, Qiaojun | Shanghai Jiao Tong University |
Huang, Siyuan | Shanghai Jiao Tong University |
Yuan, Xibin | Shanghai Jiao Tong University |
Jiang, Zhengkai | Tencent |
Hao, Ce | University of California, Berkeley |
Li, Xin | Shanghai Jiao Tong University |
Chang, Haonan | Rutgers University |
Wang, Junbo | Shanghai Jiao Tong University |
Liu, Liu | Hefei University of Technology |
Li, Hongsheng | Chinese University of Hong Kong |
Gao, Peng | Shanghai AI Lab |
Lu, Cewu | ShangHai Jiao Tong University |
Keywords: Perception for Grasping and Manipulation, Deep Learning for Visual Perception, Deep Learning in Grasping and Manipulation
Abstract: Previous studies on robotic manipulation are based on a limited understanding of the underlying 3D motion constraints and affordances. To address these challenges, we propose a comprehensive paradigm, termed UniAff, that integrates 3D object-centric manipulation and task understanding in a unified formulation. Specifically, we constructed a dataset labeled with manipulation-related key attributes, comprising 900 articulated objects from 19 categories and 600 tools from 12 categories. Furthermore, we leverage MLLMs to infer object-centric representations for manipulation tasks, including affordance recognition and reasoning about 3D motion constraints. Comprehensive experiments in both simulation and real-world settings indicate that UniAff significantly improves the generalization of robotic manipulation for tools and articulated objects. We hope that UniAff will serve as a general baseline for unified robotic manipulation tasks in the future. Images, videos, dataset and code are published on the project website at:https://sites.google.com/view/uni-aff/home.
|
|
15:30-15:35, Paper WeDT11.4 | |
ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution |
|
Rivera, Corban | Johns Hopkins University Applied Physics Lab |
Byrd, Grayson | Johns Hopkins University |
Paul, William | Johns Hopkins University Applied Physics Lab |
Feldman, Tyler | Johns Hopkins University Applied Physics Laboratory |
Booker, Meghan | Princeton University |
Holmes, Emma | Johns Hopkins University Applied Physics Lab |
Handelman, David | American Android Corp |
Kemp, Bethany | Johns Hopkins Applied Physics Laboratory |
Badger, Andrew | JHUAPL |
Schmidt, Aurora | Johns Hopkins University Applied Physic Laboratory |
Jatavallabhula, Krishna Murthy | MIT |
de Melo, Celso | CCDC US Army Research Laboratory |
Seenivasan, Lalithkumar | Johns Hopkins University |
Unberath, Mathias | Johns Hopkins University |
Chellappa, Rama | Johns Hopkins University |
Keywords: Agent-Based Systems, Mobile Manipulation
Abstract: Robotic planning and execution in open-world environments is a complex problem due to the vast state spaces and high variability of task embodiment. Recent advances in perception algorithms, combined with Large Language Models (LLMs) for planning, offer promising solutions to these challenges, as the common sense reasoning capabilities of LLMs provide a strong heuristic for efficiently searching the action space. However, prior work fails to address the possibility of hallucinations from LLMs, which results in failures to execute the planned actions largely due to logical fallacies at high- or low-levels. To contend with automation failure due to such hallucinations, we introduce ConceptAgent, a natural language-driven robotic platform designed for task execution in unstructured environments. With a focus on scalability and reliability of LLM-based planning in complex state and action spaces, we present innovations designed to limit these shortcomings, including 1) Predicate Grounding to prevent and recover from infeasible actions, and 2) an embodied version of LLM-guided Monte Carlo Tree Search with self reflection. ConceptAgent combines these planning enhancements with dynamic language aligned 3d scene graphs, and large multi-modal pretrained models to perceive, localize, and interact with its environment, enabling reliable task completion. In simulation experiments, ConceptAgent achieved a 19% task completion rate across three room layouts and 30 easy level embodied tasks outperforming other state-of-the-art LLM-driven reasoning baselines that scored 10.26% and 8.11% on the same benchmark. Additionally, ablation studies on moderate to hard embodied tasks revealed a 20% increase in task completion from the baseline agent to the fully enhanced ConceptAgent, highlighting the individual and combined contributions of Predicate Grounding and LLM-guided Tree Search to enable more robust automation in complex state and action spaces.
|
|
15:35-15:40, Paper WeDT11.5 | |
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-Guided 3D Policy |
|
Garcia-Pinel, Ricardo | Inria |
Chen, Shizhe | Inria |
Schmid, Cordelia | Inria |
Keywords: Grippers and Other End-Effectors, Software Tools for Benchmarking and Reproducibility, Deep Learning Methods
Abstract: Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation. Code, dataset, real robot videos and trained models are available at url{https://www.di.ens.fr/willow/research/gembench/}.
|
|
15:40-15:45, Paper WeDT11.6 | |
Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs |
|
Mavrogiannis, Angelos | University of Maryland, College Park |
Yuan, Dehao | University of Maryland, College Park |
Aloimonos, Yiannis | University of Maryland |
Keywords: AI-Based Methods, Computer Architecture for Robotic and Automation, Software, Middleware and Programming Environments
Abstract: There has been a lot of interest in grounding natural language to physical entities through visual context. While Vision Language Models (VLMs) can ground linguistic instructions to visual sensory information, they struggle with grounding non-visual attributes, like the weight of an object. Our key insight is that non-visual attribute detection can be effectively achieved by active perception guided by visual reasoning. To this end, we present a perception-action API that consists of VLMs and Large Language Models (LLMs) as backbones, together with a set of robot control functions. When prompted with this API and a natural language query, an LLM generates a program to actively identify attributes given an input image. Offline testing on the Odd-One-Out dataset demonstrates that our framework outperforms vanilla VLMs in detecting attributes like relative object location, size, and weight. Online testing in realistic household scenes on AI2-THOR and a real robot demonstration on a DJI RoboMaster EP robot highlight the efficacy of our approach.
|
|
15:45-15:50, Paper WeDT11.7 | |
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models |
|
Ma, Runyu | Tu Delft |
Luijkx, Jelle Douwe | Delft University of Technology |
Ajanovic, Zlatan | RWTH Aachen University |
Kober, Jens | TU Delft |
Keywords: AI-Based Methods, Reinforcement Learning
Abstract: In robot manipulation, Reinforcement Learning (RL) often suffers from low sample efficiency and uncertain convergence, especially in large observation and action spaces. Foundation Models (FMs) offer an alternative, demonstrating promise in zero-shot and few-shot settings. However, they can be unreliable due to limited physical and spatial understanding. We introduce ExploRLLM, a method that combines the strengths of both paradigms. In our approach, FMs improve RL convergence by generating policy code and efficient representations, while a residual RL agent compensates for the FMs' limited physical understanding. We show that ExploRLLM outperforms both policies derived from FMs and RL baselines in table-top manipulation tasks. Additionally, real-world experiments show that the policies exhibit promising zero-shot sim-to-real transfer. Supplementary material is available at https://explorllm.github.io.
|
|
WeDT12 |
315 |
Robotics and Automation in Construction and Industry |
Regular Session |
Co-Chair: Werfel, Justin | Harvard University |
|
15:15-15:20, Paper WeDT12.1 | |
Physical Simulation with Force Feedback Aids Robot Factors Design |
|
Kaeser, Carina | Student |
Melenbrink, Nathan | Harvard University |
Karp, Allison | Harvard |
Werfel, Justin | Harvard University |
Keywords: Product Design, Development and Prototyping, Space Robotics and Automation, Haptics and Haptic Interfaces
Abstract: "Robot factors" design, analogous to ergonomics for humans, seeks to create devices and equipment that can be readily operated by robots, by considering typical capabilities of current robots throughout the design process. While a number of principles and heuristics for robot factors design have been identified, the successful design of hardware operable by autonomous robots often depends in practice on the designer's intuition about robot capabilities, developed through personal experience working with robots. Here we present a tool we have developed to help evaluate a potential device design for usability by a robot, by allowing a designer to in effect teleoperate a virtual robot and attempt the operation of the device. The tool uses a 3D physics-based simulation built in Unity, and a Phantom Omni / Geomagic Touch haptic device that controls the virtual robot's end-effector and provides force feedback. Through user studies, we show that the use of this tool can significantly improve a user's estimation of the suitability of a design for robot operation, in two case studies involving replacing a unit in a modular hardware system and unzipping a canvas bag. By incorporating the use of such a tool early in the design cycle, designers can more effectively develop equipment to be used by autonomous robots without themselves needing direct robotics experience; as a result, robots will be able to take on more tasks in the nearer term with current robot technology.
|
|
15:20-15:25, Paper WeDT12.2 | |
Environmental Map Learning with Multiple-Robots |
|
Shamshirgaran, Azin | University of California, Merced |
Carpin, Stefano | University of California, Merced |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Multi-Robot Systems
Abstract: This paper explores decision-making processes in robotic systems tasked with reconstructing scalar fields through sensing in uncertain environments. Each robot must handle noisy perception and operate within specific environmental and physical constraints. The complexity increases in multi-agent scenarios, where robots must not only plan their actions but also anticipate the movements and strategies of other agents. Effective coordination is crucial to prevent collisions and minimize redundant tasks. To address this challenge, we propose an online, distributed multi-robot sampling algorithm that combines Monte Carlo Tree Search (MCTS) with Gaussian regression. In this approach, each robot iteratively selects its next sampling point while exchanging limited information with other robots and predicting their future actions. Predictions about other robots future actions are computed with a MCTS that is recomputed at each iteration to incorporate all information collected up to that point. We evaluate the performance of our method across diverse environments and team sizes, comparing it to algorithmic alternatives.
|
|
15:25-15:30, Paper WeDT12.3 | |
SLABIM: A SLAM-BIM Coupled Dataset in HKUST Main Building |
|
Huang, Haoming | The Hong Kong University of Science and Technology |
Qiao, Zhijian | Hong Kong University of Science and Technology |
Yu, Zehuan | Hong Kong University of Science and Technology |
Liu, Chuhao | Hong Kong University of Science and Technology |
Shen, Shaojie | Hong Kong University of Science and Technology |
Zhang, Fumin | Hong Kong University of Science and Technology |
Yin, Huan | Hong Kong University of Science and Technology |
Keywords: Robotics and Automation in Construction, Data Sets for SLAM, Data Sets for Robotic Vision
Abstract: Existing indoor SLAM datasets primarily focus on robot sensing, often lacking building architectures. To address this gap, we design and construct the first dataset to couple the SLAM and BIM, named SLABIM. This dataset provides BIM and SLAM-oriented sensor data, both modeling a university building at HKUST. The as-designed BIM is decomposed and converted for ease of use. We employ a multi-sensor suite for multi-session data collection and mapping to obtain the as-built model. All the related data are timestamped and organized, enabling users to deploy and test effectively. Furthermore, we deploy advanced methods and report the experimental results on three tasks: registration, localization and semantic mapping, demonstrating the effectiveness and practicality of SLABIM. We make our dataset open-source at https://github.com/HKUST-Aerial-Robotics/SLABIM.
|
|
15:30-15:35, Paper WeDT12.4 | |
Unified Adaptive and Cooperative Planning Using Multi-Task Coregionalized Gaussian Processes |
|
Booth, Lorenzo A. | University of California Merced |
Carpin, Stefano | University of California, Merced |
Keywords: Robotics and Automation in Agriculture and Forestry, Agricultural Automation, Planning, Scheduling and Coordination
Abstract: For robots tasked with surveying the temporal dynamics of a changing environment, a choice must be made to observe novel regions of the environment or to re-survey previously visited regions, which may have changed. We present a novel multi-robot informative path planner (IPP) that combines an environmental and task kernel to direct mobile robots to gather samples from regions that would result in the greatest expected improvement in map accuracy. Our planner utilizes a multi-output Gaussian process to unify priors about the spatiotemporal environment along with priors about observational correlations between sensing vehicles. Additionally, we extend our analysis into an adaptive planning scenario and examine the performance under different planning configurations. We find that planning performance is largely driven by the choice of environmental priors, and that unrepresentative priors can be improved through adaptive planning.
|
|
15:35-15:40, Paper WeDT12.5 | |
COIGAN: Controllable Object Inpainting through Generative Adversarial Network for Defect Synthesis in Data Augmentation |
|
Biancucci, Massimiliano | Università Politecnica Delle Marche |
Galdelli, Alessandro | Università Politecnica Delle Marche |
Narang, Gagan | Università Politecnica Delle Marche |
Pietrini, Rocco | Universià Politecnica Delle Marche |
Mancini, Adriano | Università Politecnica Delle Marche |
Zingaretti, Primo | Università Politecnica Delle Marche |
Keywords: Robotics and Automation in Construction, AI-Enabled Robotics, Deep Learning Methods
Abstract: Predictive maintenance is a key aspect for the safety of critical infrastructure such as bridges, dams, and tunnels, where a failure can lead to catastrophic outcomes in terms of human lives and costs. The surge in Artificial Intelligence-driven visual robotic inspection methods necessitates high-quality datasets containing diverse defect classes with several instances on different conditions (e.g., material, illumination). In this context, we introduce a Controllable Object Inpainting Generative Adversarial Network (COIGAN) to synthetically generate realistic images that augment defect datasets. The effectiveness of the model is quantitatively validated by a Fréchet Inception Distance, which measures the similarity between the generated and training samples. To further evaluate the impact of COIGAN-generated images, a segmentation task was conducted, utilizing key performance metrics such as segmentation accuracy, mAP, mIoU, and F1 score, demonstrating that the synthetic images integrate seamlessly and produce results comparable to real defect images. Subsequently, COIGAN generability was successfully used for the segmentation of a defect-free dataset by inpainting defects. The results showcase COIGAN's ability to learn defect patterns and apply them in new contexts, preserving the original features of the base image and allowing the creation of new datasets with a desired multi-class distribution. Specifically, in the context of predictive maintenance, COIGAN enriches datasets, enabling deep learning models to more effectively identify potential infrastructure anomalies. Project page: https://bit.ly/4bzxwqf.
|
|
15:40-15:45, Paper WeDT12.6 | |
Diffusion Based Robust LiDAR Place Recognition |
|
Krummenacher, Benjamin | ETH Zurich |
Frey, Jonas | ETH Zurich |
Tuna, Turcan | ETH Zurich, Robotic Systems Lab |
Vysotska, Olga | ETH Zurich |
Hutter, Marco | ETH Zurich |
Keywords: Robotics and Automation in Construction, Localization
Abstract: Mobile robots on construction sites require accurate pose estimation to perform autonomous surveying and inspection missions. Localization in construction sites is a particularly challenging problem due to the presence of repetitive features such as flat plastered walls and perceptual aliasing due to apartments with similar layouts inter and intra floors. In this paper, we focus on global re-positioning of a robot with respect to an accurate scanned mesh of the building solely using LiDAR data. In our approach, a neural network is trained on synthetic LiDAR point clouds generated by simulating a LiDAR in an accurate real-life large-scale mesh. We train a diffusion model with a PointNet++ backbone, which allows us to model multiple position candidates from a single LiDAR point cloud. The resulting model can successfully predict the global position of the LiDAR in confined and complex sites despite the adverse effects of perceptual aliasing. The learned distribution of potential global positions can provide multi-modal position distribution. We evaluate our approach across five real-world datasets and show the place recognition accuracy of 77% (threshold 2m) on average while outperforming baselines at a factor of 2 in mean error.
|
|
15:45-15:50, Paper WeDT12.7 | |
Enhancing Robotic Precision in Construction: A Modular Factor Graph-Based Framework to Deflection and Backlash Compensation Using High-Accuracy Accelerometers |
|
Kindle, Julien | ETH Zurich |
Loetscher, Michael | ETH Zurich, Hilti |
Alessandretti, Andrea | Hilti Group |
Cadena, Cesar | ETH Zurich |
Hutter, Marco | ETH Zurich |
Keywords: Robotics and Automation in Construction, Localization, Sensor Fusion
Abstract: Accurate positioning is crucial in the construction industry, where labor shortages highlight the need for automation. Robotic systems with long kinematic chains are required to reach complex workspaces, including floors, walls, and ceilings. These requirements significantly impact positioning accuracy due to effects such as deflection and backlash in various parts along the kinematic chain. In this letter, we introduce a novel approach that integrates deflection and backlash compensation models with high-accuracy accelerometers, significantly enhancing position accuracy. Our method employs a modular framework based on a factor graph formulation to estimate the state of the kinematic chain, leveraging acceleration measurements to inform the model. Extensive testing on publicly released datasets, reflecting real-world construction disturbances, demonstrates the advantages of our approach. The proposed method reduces the 95% error threshold in the xy-plane by 50% compared to the state-of-the-art Virtual Joint Method, and by 31% when incorporating base tilt compensation.
|
|
WeDT13 |
316 |
Manipulation and Locomotion Using Magnetic Fields |
Regular Session |
Chair: Tanner, Herbert G. | University of Delaware |
|
15:15-15:20, Paper WeDT13.1 | |
Open-Loop Position Control of a Miniature Magnetic Robot Using Two-Dimensional Divergence Control of a Magnetic Force |
|
Lee, Hakjoon | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
Latifi Gharamaleki, Nader | DGIST |
Choi, Hongsoo | Daegu Gyeongbuk Institute of Science and Technology (DGIST) |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales, Actuation and Joint Mechanisms
Abstract: Miniature magnetic robots have attracted considerable attention as promising tools in biomedical applications due to their wireless actuation and precise controllability in a minimally invasive manner. Traditionally, magnetic microrobots have been controlled by globally applied magnetic torques and forces generated by external magnetic actuation systems (MASs), which typically require closed-loop control with real-time vision tracking—a challenging requirement in in-vivo environments. To address this issue, this paper suggests a novel open-loop control scheme for magnetic robots, using two-dimensional (2D) divergence control of a magnetic force generated by stationary electromagnets. Constraint equations for the currents applied to the electromagnets were established to achieve 2D divergence control of a magnetic force. Numerical simulation and experimental validations demonstrate that this approach can generate sufficient magnetic forces that either converge at or diverge from a target point, enabling effective open-loop position control of a miniature magnetic robot. Due to the absence of vision feedback and mechanical motions of magnets, the proposed control strategy could be more clinically applicable for medical applications of magnetic robots.
|
|
15:20-15:25, Paper WeDT13.2 | |
An Equilibrium Analysis of Magnetic Quadrupole Force Field with Applications to Microrobotic Swarm Coordination |
|
Faros, Ioannis | University of Delaware |
Tanner, Herbert G. | University of Delaware |
Keywords: Swarm Robotics, Planning, Scheduling and Coordination, Micro/Nano Robots
Abstract: Controlled microrobots in fluidic environments hold promise for precise drug delivery and cell manipulation, opening new ways for personalized healthcare. However, coordinating magnetic microrobot swarms presents significant challenges due to the complexity of the associated actuation mechanisms. While existing methods to achieve motion differentiation in collections of microrobots rely on design variations among them, the work reported here applies to homogeneous collectives and enables them to be steered as a whole or in fragments, by means of a common externally generated force field. This paper contributes to an emerging set of methods that enable swarm control through manipulation of these force fields. This paper in particular exploits the nature of force field equilibria in a quadrupole workspace configuration as a means of steering the swarm while maintaining its cohesion. The approach also enables splitting the swarm in two subgroups in order to direct each simultaneously to a different location.
|
|
15:25-15:30, Paper WeDT13.3 | |
Ensemble Control of a 2-DOF Parallel Link Arm in a Capsule Robot Using Oscillating External Magnetic Fields |
|
Zhao, Zihan | The University of Sheffield |
Hafez, Ahmed | University of Sheffield |
Miyashita, Shuhei | University of Sheffield |
Keywords: Medical Robots and Systems, Mechanism Design, Micro/Nano Robots
Abstract: Providing oral capsule robots with additional degrees of freedom (DOF), such as robotic arms, is crucial for enhancing their functionality within the body. However, a key challenge arises when using rotating magnetic fields to drive the motor within the robot, as the resulting torque causes the entire capsule to rotate. In this work, we propose a novel approach to actuate a 2 DOF parallel link robot arm integrated into a capsule robot, using external magnetic fields. Our method employs two identical magnetic motors we proposed in a previous study, each driven by an oscillating magnetic field, which alternates direction along a specific axis. By independently controlling the rotation of each motor through the same magnetic field, ensemble control is achieved. The symmetrically arranged motors exhibit different angular velocities, enabling dexterous movement of the robot arm. We further theoretically show that this approach significantly reduces the torque exerted on the robot compared to traditional rotating magnetic fields. Finally, we demonstrate the performance of the robot by moving its arms and the attached end-effector along a pre-defined trajectory.
|
|
15:30-15:35, Paper WeDT13.4 | |
Deep Reinforcement Learning-Based Semi-Autonomous Control for Magnetic Micro-Robot Navigation with Immersive Manipulation |
|
Mao, Yudong | Imperial College London |
Zhang, Dandan | Imperial College London |
Keywords: Micro/Nano Robots, Automation at Micro-Nano Scales
Abstract: Magnetic micro-robots have demonstrated immense potential in biomedical applications, such as in vivo drug delivery, non-invasive diagnostics, and cell-based therapies, owing to their precise maneuverability and small size. However, current micromanipulation techniques often rely solely on a two-dimensional (2D) microscopic view as sensory feedback, while traditional control interfaces do not provide an intuitive manner for operators to manipulate micro-robots. These limitations increase the cognitive load on operators, who must interpret limited feedback and translate it into effective control actions. To address these challenges, we propose a Deep Reinforcement Learning-Based Semi-Autonomous Control (DRL-SC) framework for magnetic micro-robot navigation in a simulated microvascular system. Our framework integrates Mixed Reality (MR) to facilitate immersive manipulation of micro-robots, thereby enhancing situational awareness and control precision. Simulation and experimental results demonstrate that our approach significantly improves navigation efficiency, reduces control errors, and enhances the overall robustness of the system in simulated microvascular environments.
|
|
15:35-15:40, Paper WeDT13.5 | |
OMASTAR Optimal Magnetic Actuation System Arrangement |
|
Palanichamy, Veerash | McMaster University |
Saad, Hussein | McMaster University |
Giamou, Matthew | McMaster University |
Onaizah, Onaizah | McMaster University |
Keywords: Micro/Nano Robots, Surgical Robotics: Steerable Catheters/Needles, Optimization and Optimal Control
Abstract: Microrobots and other miniature robots are able to access millimeter-sized spaces and thus have the potential to solve many challenging problems in healthcare. However, clinical adoption of these robots is rare as these systems are often difficult to scale up. One such issue arises from the actuation systems used to remotely control magnetic microrobots, which tend to be bulky and obstruct the surgeons' workspaces. They also do not guarantee wide ranges of magnetic fields and forces in a large patient-sized workspace. In this paper, we present the design of a permanent magnet-based actuation system that fits within a 40 cm cube of space under an operating table. We also formulate a new set function maximization-based approach for efficiently designing E-optimal magnet arrangements with off-the-shelf convex solvers. Our optimization method is evaluated with synthetic data and a proof-of-concept of the system is simulated.
|
|
15:40-15:45, Paper WeDT13.6 | |
Measuring DNA Microswimmer Locomotion in Complex Flow Environments |
|
Imamura, Taryn | Carnegie Mellon University |
Kent, Teresa | Carnegie Mellon University |
Taylor, Rebecca | Carnegie Mellon University |
Bergbreiter, Sarah | Carnegie Mellon University |
Keywords: Micro/Nano Robots, Biologically-Inspired Robots, Automation at Micro-Nano Scales
Abstract: Microswimmers are sub-millimeter swimming robots that show potential as a platform for controllable locomotion in applications, including targeted cargo delivery and minimally invasive surgery. To be viable for these target applications, microswimmers will eventually need to be able to navigate environments with dynamic fluid flows and forces. Experimental studies with microswimmers towards this goal are currently rare because of the difficulty of isolating intentional microswimmer locomotion from environment-induced motion. In this work, we present a method for measuring microswimmer locomotion within a complex flow environment using fiducial microspheres. By tracking the particle motion of ferromagnetic and non-magnetic polystyrene fiducial microspheres, we capture the effect of fluid flow and magnetic field gradients on microswimmer trajectories. We then determine the field-driven translation of these microswimmers relative to fluid flow and demonstrate the effectiveness of this method by illustrating the motion of multiple microswimmers through different flows.
|
|
15:45-15:50, Paper WeDT13.7 | |
Position Regulation of a Conductive Nonmagnetic Object with Two Stationary Field Sources |
|
Dalton, Devin | University of Utah |
Tabor, Griffin | University of Utah |
Hermans, Tucker | University of Utah |
Abbott, Jake J. | University of Utah |
Keywords: Dexterous Manipulation, Force Control, Manipulation Planning, Space Robotics and Automation
Abstract: Recent research has shown that eddy currents induced by rotating magnetic dipole fields in conductive nonmagnetic objects can produce forces and torques that enable dexterous manipulation. This paradigm shows promise for application in the remediation of space debris. The induced force from each rotating-magnetic-dipole field source always includes a repulsive component, suggesting that the object should be surrounded by field sources to some degree to ensure the object does not leave the dexterous workspace during manipulation. In this paper, we show that it is possible to fully control the position of an object using just two stationary field sources, provided the object is near the midpoint between the field sources. A given position controller requires a low-level force controller. We propose two new force controllers and compare them with the state-of-the-art method from the literature. One of the new force controllers is particularly good at not inducing parasitic torques, which is hypothesized to be beneficial for future tasks manipulating rotating resident space objects. We perform experimental verification using numerical and physical simulators of microgr
|
|
WeDT14 |
402 |
Social Navigation 1 |
Regular Session |
Chair: Mavrogiannis, Christoforos | University of Michigan |
Co-Chair: Kästner, Linh | T-Mobile, TU Berlin |
|
15:15-15:20, Paper WeDT14.1 | |
From Cognition to Precognition: A Future-Aware Framework for Social Navigation |
|
Gong, Zeying | Hong Kong University of Science and Technology (Guangzhou) |
Hu, Tianshuai | The Hong Kong University of Science and Technology |
Qiu, Ronghe | The Hong Kong University of Science and Technology (Guangzhou) |
Liang, Junwei | HKUST (Guangzhou) |
Keywords: Vision-Based Navigation, Human-Aware Motion Planning
Abstract: To navigate safely and efficiently in crowded spaces, robots should not only perceive the current state of the environment but also anticipate future human movements. In this paper, we propose a reinforcement learning architecture, namely Falcon, to tackle socially-aware navigation by explicitly predicting human trajectories and penalizing actions that block future human paths. To facilitate realistic evaluation, we introduce a novel SocialNav benchmark containing two new datasets, Social-HM3D and Social-MP3D. This benchmark offers large-scale photo-realistic indoor scenes populated with a reasonable amount of human agents based on scene area size, incorporating natural human movements and trajectory patterns. We conduct a detailed experimental analysis with the state-of-the-art learning-based method and two classic rule-based path-planning algorithms on the new benchmark. The results demonstrate the importance of future prediction and our method achieves the best task success rate of 55% while maintaining about 90% personal space compliance. We will release our code and datasets.
|
|
15:20-15:25, Paper WeDT14.2 | |
OLiVia-Nav: An Online Lifelong Vision Language Approach for Mobile Robot Social Navigation |
|
Narasimhan, Siddarth | University of Toronto |
Tan, Aaron Hao | University of Toronto |
Choi, Daniel | University of Toronto |
Nejat, Goldie | University of Toronto |
Keywords: Service Robotics, Human-Aware Motion Planning, Continual Learning
Abstract: Service robots in human-centered environments such as hospitals, office buildings, and long-term care homes need to navigate while adhering to social norms to ensure the safety and comfortability of the people they are sharing the space with. Furthermore, they need to adapt to new social scenarios that can arise during robot navigation. In this paper, we present a novel Online Lifelong Vision Language architecture, OLiVia-Nav, which uniquely integrates vision-language models (VLMs) with an online lifelong learning framework for robot social navigation. We introduce a unique distillation approach, Social Context Contrastive Language Image Pre-training (SC-CLIP), to transfer the social reasoning capabilities of large VLMs to a lightweight VLM, in order for OLiVia-Nav to directly encode social and environment context during robot navigation. These encoded embeddings are used to generate and select robot social compliant trajectories. The lifelong learning capabilities of SC-CLIP enable OLiVia-Nav to update the robot trajectory planning overtime as new social scenarios are encountered. We conducted extensive real-world experiments in diverse social navigation scenarios. The results showed that OLiVia-Nav outperformed existing state-of-the-art DRL and VLM methods in terms of mean squared error, Hausdorff loss, and personal space violation duration. Ablation studies also verified the design choices for OLiVia-Nav.
|
|
15:25-15:30, Paper WeDT14.3 | |
Arena 4.0: A Comprehensive ROS2 Development and Benchmarking Platform for Human-Centric Navigation Using Generative-Model-Based Environment Generation |
|
Shcherbyna, Volodymyr | Technical University Berlin |
Kästner, Linh | T-Mobile, TU Berlin |
Diaz, Diego | Technical University Berlin |
Nguyen Huu Truong, Giang | HaNoi University of Science and Technology |
Schreff, Maximilian Ho-Kyoung | Technical University Berlin |
Seeger, Tim | Technical University Berlin |
Kreutz, Jonas | Technical University Berlin |
Martban, Ahmed | Technical University Berlin |
Shen, Zhengcheng | TU Berlin |
Zeng, Huajian | Technical University Munich |
Soh, Harold | National University of Singapore |
Keywords: Software Tools for Benchmarking and Reproducibility, Simulation and Animation, Human-Aware Motion Planning
Abstract: Building upon the foundations laid by our previous work, this paper introduces Arena 4.0, a significant advancement of Arena 3.0, Arena-Bench, Arena 1.0, and Arena 2.0. Arena 4.0 provides three main novel contributions: 1) a generative-model-based world and scenario generation approach using large language models (LLMs) and diffusion models, to dynamically generate complex, human-centric environments from text prompts or 2D floorplans that can be used for development and benchmarking of social navigation strategies. 2) A comprehensive 3D model database which can be extended with 3D assets and semantically linked and annotated using a variety of metrics for dynamic spawning and arrangements inside 3D worlds. 3) The complete migration towards ROS 2, which ensures operation with state-of-the-art hardware and functionalities for improved navigation, usability, and simplified transfer towards real robots. We evaluated the platforms performance through a comprehensive user study and its world generation capabilities for benchmarking demonstrating significant improvements in usability and efficiency compared to previous versions. Arena 4.0 is openly available at https://github.com/Arena-Rosnav.
|
|
15:30-15:35, Paper WeDT14.4 | |
Active Inference-Based Planning for Safe Human-Robot Interaction: Concurrent Consideration of Human Characteristic and Rationality |
|
Nam, Youngim | Ulsan National Institute of Science and Technology |
Kwon, Cheolhyeon | Ulsan National Institute of Science and Technology |
Keywords: Human-Aware Motion Planning, Safety in HRI, Planning under Uncertainty
Abstract: This paper proposes a motion planning strategy for a robot to safely interact with humans exhibiting uncertain actions. The human actions are often encoded by the internal states that are attributed to human characteristics and rationality. First, by leveraging a continuous level of rationality, we compute the belief on human rationality along with his/her characteristic. This systematically reasons out the uncertainty in the observed human action, thereby better assessing the potential safety risks during the interaction. Second, based on the computed belief over the human internal states, we formulate a Stochastic Model Predictive Control (SMPC) problem to plan the robot’s actions such that it safely achieves its goal while also actively inferring on the human internal state. To cope with the expensive computation of the SMPC, we develop a sampling-based technique that efficiently evaluates the robot’s action conditioned on human uncertainty. The experiment results demonstrate that the proposed strategy excels in human action prediction, and significantly improves the safety and efficiency of Human-Robot Interaction (HRI).
|
|
15:35-15:40, Paper WeDT14.5 | |
Characterizing the Complexity of Social Robot Navigation Scenarios |
|
Stratton, Andrew | University of Michigan |
Hauser, Kris | University of Illinois at Urbana-Champaign |
Mavrogiannis, Christoforos | University of Michigan |
Keywords: Human-Aware Motion Planning, Performance Evaluation and Benchmarking, Human-Centered Robotics
Abstract: Social robot navigation algorithms are often demonstrated in overly simplified scenarios, prohibiting the extraction of practical insights about their relevance to real-world domains. Our key insight is that an understanding of the inherent complexity of a social robot navigation scenario could help characterize the limitations of existing navigation algorithms and provide actionable directions for improvement. Through an exploration of recent literature, we identify a series of factors contributing to the complexity of a scenario, disambiguating between contextual and robot-related ones. We then conduct a simulation study investigating how manipulations of contextual factors impact the performance of a variety of navigation algorithms. We find that dense and narrow environments correlate most strongly with performance drops, while the heterogeneity of agent policies and directionality of interactions have a less pronounced effect. This motivates a shift towards developing and testing algorithms under higher-complexity settings.
|
|
15:40-15:45, Paper WeDT14.6 | |
Domain Randomization for Learning to Navigate in Human Environments (Resubmission) |
|
Ah Sen, Nick | Monash University |
Kulic, Dana | Monash University |
Carreno, Pamela | Monash University |
Keywords: Human-Aware Motion Planning, Reinforcement Learning
Abstract: In shared human-robot environments, effective navigation requires robots to adapt to various pedestrian behaviors encountered in the real world. Most existing deep reinforcement learning algorithms for human-aware robot navigation typically assume that pedestrians adhere to a single walking behavior during training, limiting their practicality/performance in scenarios where pedestrians exhibit various types of behavior. In this work, we propose to enhance the generalization capabilities of human-aware robot navigation by employing Domain Randomization (DR) techniques to train navigation policies on a diverse range of simulated pedestrian behaviors with the hope of better generalization to the real world. We evaluate the effectiveness of our method by comparing the generalization capabilities of a robot navigation policy trained with and without DR, both in simulations and through a real-user study, focusing on adaptability to different pedestrian behaviors, performance in novel environments, and users' perceived comfort, sociability and naturalness. Our findings reveal that the use of DR significantly enhances the robot's social compliance in both simulated and real-life contexts.
|
|
WeDT15 |
403 |
Manipulation Planning |
Regular Session |
Chair: Cheng, Xianyi | Duke University |
Co-Chair: Shirai, Yuki | Mitsubishi Electric Research Laboratories |
|
15:15-15:20, Paper WeDT15.1 | |
Characterizing Manipulation Robustness through Energy Margin and Caging Analysis |
|
Dong, Yifei | KTH |
Cheng, Xianyi | Carnegie Mellon University |
Pokorny, Florian T. | KTH Royal Institute of Technology |
Keywords: Manipulation Planning, Dexterous Manipulation, Grasping
Abstract: To develop robust manipulation policies, quantifying robustness is essential. Evaluating robustness in general dexterous manipulation, nonetheless, poses significant challenges due to complex hybrid dynamics, combinatorial explosion of possible contact interactions, global geometry, etc. This paper introduces ``caging in motion'', an approach for analyzing manipulation robustness through energy margins and caging-based analysis. Our method assesses manipulation robustness by measuring the energy margin to failure and extends traditional caging concepts for a global analysis of dynamic manipulation. This global analysis is facilitated by a kinodynamic planning framework that naturally integrates global geometry, contact changes, and robot compliance. We validate the effectiveness of our approach in the simulation and real-world experiments of multiple dynamic manipulation scenarios, highlighting its potential to predict manipulation success and robustness.
|
|
15:20-15:25, Paper WeDT15.2 | |
Enhancing Adaptivity of Two-Fingered Object Reorientation Using Tactile-Based Online Optimization of Deconstructed Actions |
|
Huang, Qiyin | Tsinghua University |
Li, Tiemin | Tsinghua University |
Jiang, Yao | Tsinghua University |
Keywords: Manipulation Planning, Grippers and Other End-Effectors, Perception for Grasping and Manipulation
Abstract: Object reorientation is a critical task for robotic grippers, especially when manipulating objects within constrained environments. The task poses significant challenges for motion planning due to the high-dimensional output actions with the complex input information, including unknown object properties and nonlinear contact forces. Traditional approaches simplify the problem by reducing degrees of freedom, limiting contact forms, or acquiring environment/object information in advance—significantly compromising adaptability. To address these challenges, we deconstruct the complex output actions into three fundamental types based on tactile sensing: task-oriented actions, constraint-oriented actions, and coordinating actions. These actions are then optimized online using gradient optimization to enhance adaptability. Key contributions include simplifying contact state perception, decomposing complex gripper actions, and enabling online action optimization for handling unknown objects or environmental constraints. Experimental results demonstrate that the proposed method is effective across a range of everyday objects, regardless of environmental contact. Additionally, the method exhibits robust performance even in the presence of unknown contacts and nonlinear external disturbances.
|
|
15:25-15:30, Paper WeDT15.3 | |
A Full-Cycle Assembly Operation: From Digital Planning to Trajectory Execution Using a Robotic Arm |
|
Livnat, Dror | Tel Aviv University |
Lavi, Yuval | Tel Aviv University |
Halperin, Dan | Tel Aviv University |
Keywords: Manipulation Planning, Motion and Path Planning
Abstract: We present an end-to-end framework for planning tight assembly operations, where the input is a set of digital models, and the output is a full execution plan for a physical robotic arm, including the trajectory placement and the grasping. The framework builds on our earlier results on tight assembly planning for free-flying objects and includes the following novel components: (i) the framework itself together with physical demonstrations, (ii) trajectory placement based on novel dynamic pathwise IK and (iii) post processing of the free-flying paths to relax the tightness and smooth the path. The framework provides guarantees as to the quality of the outcome trajectory. For each component we provide the algorithmic details and a full opensource software package for reproducing the process. Lastly, we demonstrate the framework with tight and challenging assembly problems (as well as puzzles, which are planned to be hard to assemble), using a UR5e robotic arm in the real world and in simulation. See the figure at the top for a physical UR5e assembling the alpha-z puzzle (known to be considerably more complicated to assemble than the celebrated alpha puzzle). Full video clips of all the assembly demonstrations together with our open source software are available at our project page: https://tau-cgl.github.io/Full-Cycle-Assembly-Operation/
|
|
15:30-15:35, Paper WeDT15.4 | |
Robust Nonprehensile Dynamic Object Transportation: A Closed-Loop Sensitivity Approach |
|
Teimoorzadeh, Ainoor | Technical University of Munich |
Pupa, Andrea | University of Modena and Reggio Emilia |
Selvaggio, Mario | Università Degli Studi Di Napoli Federico II |
Haddadin, Sami | Mohamed Bin Zayed University of Artificial Intelligence |
Keywords: Manipulation Planning, Planning under Uncertainty, Motion and Path Planning
Abstract: In this paper, we propose a closed-loop sensitivity-based approach to enhance the robustness of robotic nonprehensile dynamic manipulation tasks.The proposed method aims at fulfilling the transportation of an object, that is free to move on a tray-shaped robot end–effector, in face of not perfectly known nominal dynamic parameters. The approach is built up on taking the parameterized reference trajectory to be tracked as the optimization variable minimizing a norm of the system closed-loop sensitivity. The resulting optimal reference trajectory is inherently more robust to the parametric variations of object dynamic properties compared to a baseline straight trajectory execution. The tracking performance is assessed and validated along hardware experiments and an extensive simulation campaign assessing the superior robustness of our approach.
|
|
15:35-15:40, Paper WeDT15.5 | |
Hierarchical Contact-Rich Trajectory Optimization for Multi-Modal Manipulation Using Tight Convex Relaxations |
|
Shirai, Yuki | Mitsubishi Electric Research Laboratories |
Raghunathan, Arvind | Mitsubishi Electric Research Laboratories |
Jha, Devesh | Mitsubishi Electric Research Laboratories |
Keywords: Manipulation Planning, Multi-Contact Whole-Body Motion Planning and Control, Optimization and Optimal Control
Abstract: Designing trajectories for manipulation through contact is challenging as it requires reasoning of object & robot trajectories as well as complex contact sequences simultaneously. In this paper, we present a novel framework for simultaneously designing trajectories of robots, objects, and contacts efficiently for contact-rich manipulation. We propose a hierarchical optimization framework where Mixed-Integer Linear Program (MILP) selects optimal contacts between robot & object using approximate dynamical constraints, and then a NonLinear Program (NLP) optimizes trajectory of the robot(s) and object considering full nonlinear constraints. We present a convex relaxation of bilinear constraints using binary encoding technique such that MILP can provide tighter solutions with better computational complexity. The proposed framework is evaluated on various manipulation tasks where it can reason about complex multi-contact interactions while providing computational advantages. We also demonstrate our framework in hardware experiments using a bimanual robot system.
|
|
15:40-15:45, Paper WeDT15.6 | |
Constraining Gaussian Process Implicit Surfaces for Robot Manipulation Via Dataset Refinement |
|
Kumar, Abhinav | University of Michigan |
Mitrano, Peter | University of Michigan |
Berenson, Dmitry | University of Michigan |
Keywords: Manipulation Planning, Motion and Path Planning
Abstract: Model-based control faces fundamental challenges in partially-observable environments due to unmodeled obstacles. We propose an online learning and optimization method to identify and avoid unobserved obstacles online. Our method, Constraint Obeying Gaussian Implicit Surfaces (COGIS), infers contact data using a combination of visual input and state tracking, informed by predictions from a nominal dynamics model. We then fit a Gaussian process implicit surface (GPIS) to these data and refine the dataset through a novel method of enforcing constraints on the estimated surface. This allows us to design a Model Predictive Control (MPC) method that leverages the obstacle estimate to complete multiple manipulation tasks. By modeling the environment instead of attempting to directly adapt the dynamics, our method succeeds at both low-dimensional peg-in-hole tasks and high-dimensional deformable object manipulation tasks. Our method succeeds in 10/10 trials vs 1/10 for a baseline on a real-world cable manipulation task under partial observability of the environment.
|
|
WeDT16 |
404 |
Optimization and Trajectory Planning |
Regular Session |
Chair: Figueroa, Nadia | University of Pennsylvania |
Co-Chair: Zinage, Vrushabh | University of Texas at Austin |
|
15:15-15:20, Paper WeDT16.1 | |
Optimizing Complex Control Systems with Differentiable Simulators: A Hybrid Approach to Reinforcement Learning and Trajectory Planning |
|
Parag, Amit | Sintef Ocean AS |
Mansard, Nicolas | CNRS |
Misimi, Ekrem | SINTEF Ocean |
Keywords: Optimization and Optimal Control, Reinforcement Learning, Machine Learning for Robot Control
Abstract: Deep reinforcement learning (RL) often relies on simulators as abstract oracles to model interactions within complex environments. While differentiable simulators have recently emerged for multi-body robotic systems, they remain underutilized, despite their potential to provide richer information. This underutilization, coupled with the high computational cost of exploration-exploitation in high-dimensional state spaces, limits the practical application of RL in the real-world. We propose a method that integrates learning with differentiable simulators to enhance the efficiency of exploration-exploitation. Our approach learns value functions, state trajectories, and control policies from locally optimal runs of a model-based trajectory optimizer. The learned value function acts as a proxy to shorten the preview horizon, while approximated state and control policies guide the trajectory optimization. We benchmark our algorithm on three classical control problems and a torque-controlled 7 degree-of-freedom robot manipulator arm, demonstrating faster convergence and a more efficient symbiotic relationship between learning and simulation for end-to-end training of complex, poly-articulated systems.
|
|
15:20-15:25, Paper WeDT16.2 | |
TransformerMPC: Accelerating Model Predictive Control Via Transformers |
|
Zinage, Vrushabh | University of Texas at Austin |
Khalil, Ahmed | The University of Texas at Austin |
Bakolas, Efstathios | The University of Texas at Austin |
Keywords: Optimization and Optimal Control, AI-Based Methods, Autonomous Agents
Abstract: In this paper, we address the problem of reducing the computational burden of Model Predictive Control (MPC) for real-time robotic applications. We propose TransformerMPC, a method that enhances the computational efficiency of MPC algorithms by leveraging the attention mechanism in transformers for both online constraint removal and better warm start initialization. Specifically, TransformerMPC accelerates the computation of optimal control inputs by selecting only the active constraints to be included in the MPC problem, while simultaneously providing a warm start to the optimization process. This approach ensures that the original constraints are satisfied at optimality. TransformerMPC is designed to be seamlessly integrated with any solver, irrespective of its implementation. To guarantee constraint satisfaction after removing inactive constraints, we perform an offline verification to ensure that the optimal control inputs generated by the solver meet all constraints. The effectiveness of TransformerMPC is demonstrated through extensive numerical simulations on complex robotic systems, achieving up to 35x improvement in runtime without any loss in performance.
|
|
15:25-15:30, Paper WeDT16.3 | |
A New Semidefinite Relaxation for Linear and Piecewise Affine Optimal Control with Time Scaling |
|
Yang, Lujie | MIT |
Marcucci, Tobia | Massachusetts Institute of Technology |
Parrilo, Pablo | MIT |
Tedrake, Russ | Massachusetts Institute of Technology |
Keywords: Optimization and Optimal Control, Motion and Path Planning
Abstract: We introduce a semidefinite relaxation for optimal control of linear systems with time scaling. These problems are inherently nonconvex, since the system dynamics involves bilinear products between the discretization time step and the system state and controls. The proposed relaxation is closely related to the standard second-order semidefinite relaxation for quadratic constraints, but we carefully select a subset of the possible bilinear terms and apply a change of variables to achieve empirically tight relaxations while keeping the computational load light. We further extend our method to handle piecewise-affine (PWA) systems by formulating the PWA optimal-control problem as a shortest-path problem in a graph of convex sets (GCS). In this GCS, different paths represent different mode sequences for the PWA system, and the convex sets model the relaxed dynamics within each mode. By combining a tight convex relaxation of the GCS problem with our semidefinite relaxation with time scaling, we can solve PWA optimal-control problems through a single semidefinite program.
|
|
15:30-15:35, Paper WeDT16.4 | |
C-Uniform Trajectory Sampling for Fast Motion Planning |
|
Poyrazoglu, Oguzhan Goktug | University of Minnesota |
Cao, Yukang | University of Minnesota |
Isler, Volkan | University of Minnesota |
Keywords: Optimization and Optimal Control, Motion and Path Planning, Collision Avoidance
Abstract: We study the problem of sampling robot trajectories and introduce the notion of C-Uniformity. As opposed to the standard method of uniformly sampling control inputs (which lead to biased samples of the configuration space), C-Uniform trajectories are generated by control actions which lead to uniform sampling of the configuration space. After presenting an intuitive closed-form solution to generate C-Uniform trajectories for the 1D random-walker, we present a network flow based optimization method to precompute C-Uniform trajectories for general robot systems. We apply the notion of C-Uniformity to the design of Model Predictive Path Integral controllers. Through simulation experiments, we show that using C-Uniform trajectories significantly improves the performance of MPPI-style controllers, achieving up to 40% coverage performance gain compared to the best baseline. We demonstrate the practical applicability of our method with an implementation on a 1/10th scale racer.
|
|
15:35-15:40, Paper WeDT16.5 | |
ADMM-MCBF-LCA: A Layered Control Architecture for Safe Real-Time Navigation |
|
Srikanthan, Anusha | University of Pennsylvania |
Xue, Yifan | University of Pennsylvania |
Kumar, Vijay | University of Pennsylvania |
Matni, Nikolai | University of Pennsylvania |
Figueroa, Nadia | University of Pennsylvania |
Keywords: Optimization and Optimal Control, Integrated Planning and Control
Abstract: We consider the problem of safe real-time navigation of a robot in a dynamic environment with moving obstacles of arbitrary smooth geometries and input saturation constraints. We assume that the robot detects and models nearby obstacle boundaries with a short-range sensor and that this detection is error-free. This problem presents three main challenges: i) input constraints, ii) safety, and iii) real-time computation. To tackle all three challenges, we present a layered control architecture (LCA) consisting of an offline path library generation layer, and an online path selection and safety layer. To overcome the limitations of reactive methods, our offline path library consists of feasible controllers, feedback gains, and reference trajectories. To handle computational burden and safety, we solve online path selection and generate safe inputs that run at 100 Hz. Through simulations on Gazebo and Fetch hardware in an indoor environment, we evaluate our approach against baselines that are layered, end-to-end, or reactive. Our experiments demonstrate that among all algorithms, only our proposed LCA is able to complete tasks such as reaching a goal, safely. When comparing metrics such as safety, input error, and success rate, we show that our approach generates safe and feasible inputs throughout the robot execution.
|
|
15:40-15:45, Paper WeDT16.6 | |
Transformer-Based Model Predictive Control: Trajectory Optimization Via Sequence Modeling |
|
Celestini, Davide | Politecnico Di Torino |
Gammelli, Daniele | Stanford |
Guffanti, Tommaso | Stanford University |
D’Amico, Simone | Stanford University |
Capello, Elisa | Politecnico Di Torino CNR IEIIT |
Pavone, Marco | Stanford University |
Keywords: Optimization and Optimal Control, Deep Learning Methods, Machine Learning for Robot Control
Abstract: Model predictive control (MPC) has established itself as the primary methodology for constrained control, enabling general-purpose robot autonomy in diverse real-world scenarios. However, for most problems of interest, MPC relies on the recursive solution of highly non-convex trajectory optimization problems, leading to high computational complexity and strong dependency on initialization. In this work, we present a unified framework to combine the main strengths of optimization-based and learning-based methods for MPC. Our approach entails embedding high-capacity, transformer-based neural network models within the optimization process for trajectory generation, whereby the transformer provides a near-optimal initial guess, or target plan, to a non-convex optimization problem. Our experiments, performed in simulation and the real world onboard a free flyer platform, demonstrate the capabilities of our framework to improve MPC convergence and runtime. Compared to purely optimization-based approaches, results show that our approach can improve trajectory generation performance by up to 75%, reduce the number of solver iterations by up to 45%, and improve overall MPC runtime by 7x without loss in performance.
|
|
15:45-15:50, Paper WeDT16.7 | |
Experimental Validation of Sensitivity-Aware Trajectory Planning for a Redundant Robotic Manipulator under Payload Uncertainty |
|
Srour, Ali | CNRS |
Franchi, Antonio | University of Twente / Sapienza University of Rome |
Robuffo Giordano, Paolo | Irisa Cnrs Umr6074 |
Cognetti, Marco | LAAS-CNRS and Université Toulouse III - Paul Sabatier |
Keywords: Optimization and Optimal Control, Planning under Uncertainty, Manipulation Planning
Abstract: In this paper, we experimentally validate the recent concepts of closed-loop state and input sensitivity in the context of robust manipulation control for a robot manipulator. Our objective is to assess how optimizing trajectories with respect to sensitivity metrics can enhance the closed-loop system’s performance w.r.t. model uncertainties, such as those arising from payload variations during precise manipulation tasks. We conduct a series of experiments to validate our optimization approach across different trajectories, focusing primarily on evaluating the precision of the manipulator’s end-effector at critical moments where high accuracy is essential. Our findings offer valuable insights into improving the closed-loop robustness of the robot’s state and inputs against physical parametric uncertainties that could otherwise degrade the system performance.
|
|
WeDT17 |
405 |
Soft Robotics: Modeling, Control, and Learning |
Regular Session |
Chair: Zhang, Jianwei | University of Hamburg |
Co-Chair: Sun, Ye | University of Virginia |
|
15:15-15:20, Paper WeDT17.1 | |
Composite Learning Neural Network Tracking Control of Articulated Soft Robots |
|
Zou, Zhigang | Sun Yat-Sen University |
Li, Zhiwen | Sun Yat-Set University |
Li, Weibing | Sun Yat-Sen University |
Pan, Yongping | Peng Cheng Laboratory |
Keywords: Model Learning for Control, Compliant Joints and Mechanisms, Neural and Fuzzy Control
Abstract: Controlling articulated soft robots (ASRs) driven by variable stiffness actuators (VSAs) is challenging because they are highly nonlinear and difficult to model accurately. This paper proposes an efficient neural network (NN) learning control solution for ASRs driven by agonistic-antagonistic (AA)-VSAs to guarantee tracking performance without exact robot models. Composite learning resorts to memory regressor extension to enhance adaptive parameter estimation such that parameter convergence can be guaranteed without the stringent condition of persistent excitation. In the proposed method, an NN-based controller is constructed for the position tracking of AA-VSA- driven ASRs, and an NN weight update law based on composite learning is developed to enhance online modeling and control capabilities. Experiments are carried out on an ASR with three degrees of freedom and qbmove Advance actuators (a kind of AA-VSAs), which have validated the effectiveness and superiority of the proposed method in terms of modeling and tracking accuracy compared with existing control methods.
|
|
15:20-15:25, Paper WeDT17.2 | |
Multi-Segment Soft Robot Control Via Deep Koopman-Based Model Predictive Control |
|
Lv, Lei | Tongji University |
Liu, Lei | Tsinghua University |
Bao, Lei | Beijing Soft Robot Tech Co., Ltd |
Sun, Fuchun | Tsinghua University |
Dong, Jiahong | Tsinghua University Affiliated Beijing Tsinghua Changgung Hospit |
Zhang, Jianwei | University of Hamburg |
Shan, Xuemei | Beijing Soft Robot Tech Co., Ltd |
Sun, Kai | Tsinghua University |
Huang, Hao | Beihang University |
Luo, Yu | Tsinghua University |
Keywords: Modeling, Control, and Learning for Soft Robots
Abstract: Soft robots, compared to regular rigid robots, as their multiple segments with soft materials bring flexibility and compliance, have the advantages of safe interaction and dexterous operation in the environment. However, due to its characteristics of high dimensional, nonlinearity, time-varying nature, and infinite degree of freedom, it has been challenges in achieving precise and dynamic control such as trajectory tracking and position reaching. To address these challenges, we propose a framework of Deep Koopman-based Model Predictive Control (DK-MPC) for handling multi-segment soft robots. We first employ a deep learning approach with sampling data to approximate the Koopman operator, which therefore linearizes the high-dimensional nonlinear dynamics of the soft robots into a finite-dimensional linear representation. Secondly, this linearized model is utilized within a model predictive control framework to compute optimal control inputs that minimize the tracking error between the desired and actual state trajectories. The real-world experiments on the soft robot “Chordata” demonstrate that DK-MPC could achieve high-precision control, showing the potential of DK-MPC for future applications to soft robots. More visualization results can be found at https://pinkmoon-io.github.io/DKMPC/.
|
|
15:25-15:30, Paper WeDT17.3 | |
Physics-Informed Split Koopman Operators for Data-Efficient Soft Robotic Simulation |
|
Ristich, Eron | Arizona State University |
Zhang, Lei | Arizona State University |
Ren, Yi | Arizona State University |
Sun, Jiefeng | Arizona State University |
Keywords: Modeling, Control, and Learning for Soft Robots, Model Learning for Control, Dynamics
Abstract: Koopman operator theory provides a powerful data-driven technique for modeling nonlinear dynamical systems in a linear framework, in comparison to computationally expensive and highly nonlinear physics-based simulations. However, Koopman operator-based models for soft robots are very high dimensional and require considerable amounts of data to properly resolve. Inspired by physics-informed techniques from machine learning, we present a novel physics-informed Koopman operator identification method that improves simulation accuracy for small dataset sizes. Through Strang splitting, the method takes advantage of both continuous and discrete Koopman operator approximation to obtain information both from trajectory and phase space data. The method is validated on a tendon-driven soft robotic arm, showing orders of magnitude improvement over standard methods in terms of the shape error. We envision this method can significantly reduce the data requirement of Koopman operators for systems with partially known physical models, and thus reduce the cost of obtaining data. More info: https://sunrobotics.lab.asu.edu/blog/2024/ristich-icra-2025/
|
|
15:30-15:35, Paper WeDT17.4 | |
Robust Swimming Controller for Soft Robots Via Drop-Out Learning |
|
Monica, Josephine | Cornell University |
Campbell, Mark | Cornell University |
Keywords: Soft Robot Applications, Reinforcement Learning, Robust/Adaptive Control
Abstract: A novel framework for training a robotic fish to learn how to swim, even in the presence of degradations or failures in actuators is developed. Robotic underwater robots, particularly soft fish-inspired designs have gained significant attention due to their distinct benefits, including superior maneuverability, energy efficiency, versatile applications, and seamless integration with marine environments. However, their material properties and actuators can degrade, leading to pre-mature system failures. In this paper, we introduce the concept of actuator drop-out during training, to enable the robot to learn how to swim even when one or more actuators are degraded or non-functional. A Soft Actor-Critic Deep Reinforcement Learning architecture is used to learn a policy, with actuator degradations/failures introduced during training. A four actuator koi fish is modeled and simulated using the FishGym environment. Navigation-based validation tests show little degradation with one actuator failure, and much more robust swimming behaviors and performance compared to training with no failures, even when two or three actuators fail. These results will improve long-term operational reliability, ensuring robot fish functionality even in challenging underwater conditions.
|
|
15:35-15:40, Paper WeDT17.5 | |
Optimal Gait Control for a Tendon-Driven Soft Quadruped Robot by Model-Based Reinforcement Learning |
|
Niu, Xuezhi | Uppsala University |
Tan, Kaige | KTH Royal Institute of Technology |
Gurdur Broo, Didem | Uppsala University |
Feng, Lei | KTH Royal Institute of Technology |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Sensors and Actuators, Reinforcement Learning
Abstract: This study presents an innovative approach to optimal gait control for a soft quadruped robot enabled by four compressible tendon-driven soft actuators. Soft quadruped robots, compared to their rigid counterparts, are widely recognized for offering enhanced safety, lower weight, and simpler fabrication and control mechanisms. However, their highly deformable structure introduces nonlinear dynamics, making precise gait locomotion control complex. To solve this problem, we propose a novel model-based reinforcement learning (MBRL) method. The study employs a multi-stage approach, including state space restriction, data-driven surrogate model training, and MBRL development. Compared to benchmark methods, the proposed approach significantly improves the efficiency and performance of gait control policies. The developed policy is both robust and adaptable to the robot's deformable morphology. The study concludes by highlighting the practical applicability of these findings in real-world scenarios.
|
|
15:40-15:45, Paper WeDT17.6 | |
Physics-Guided Deep Learning Enabled Surrogate Modeling for Pneumatic Soft Robots |
|
Beaber, Sameh I. | University of Virginia |
Liu, Zhen | University of Virginia |
Sun, Ye | University of Virginia |
Keywords: Modeling, Control, and Learning for Soft Robots, Soft Robot Applications
Abstract: Soft robots, formulated by soft and compliant materials, have grown significantly in recent years toward safe and adaptable operations and interactions with dynamic environments. Modeling the complex, nonlinear behaviors and controlling the deformable structures of soft robots present challenges. This study aims to establish a physics-guided deep learning (PGDL) computational framework that integrates physical models into deep learning framework as surrogate models for soft robots. Once trained, these models can replace computationally expensive numerical simulations to shorten the computation time and enable real-time control. This PGDL framework is among the first to integrate first principle physics of soft robots into deep learning toward highly accurate yet computationally affordable models for soft robot modeling and control. The proposed framework has been implemented and validated using three different pneumatic soft fingers with different behaviors and geometries, along with two training and testing approaches, to demonstrate its effectiveness and generalizability. The results showed that the mean square error (MSE) of predicted deformed curvature and the maximum and minimum deformation at various loading conditions were as low as 10−4 mm2. The proposed PGDL framework is constructed from first principle physics and intrinsically can be applicable to various conditions by carefully considering the governing equations, auxiliary equations, and the corresponding boundary and initial conditions.
|
|
15:45-15:50, Paper WeDT17.7 | |
Learning-Based Nonlinear Model Predictive Control of Articulated Soft Robots Using Recurrent Neural Networks |
|
Schaefke, Hendrik | Leibniz University Hannover |
Habich, Tim-Lukas | Leibniz University Hannover |
Muhmann, Christian | Leibniz University Hannover |
Ehlers, Simon F. G. | Leibniz University Hannover |
Seel, Thomas | Leibniz Universität Hannover |
Schappler, Moritz | Institute of Mechatronic Systems, Leibniz Universitaet Hannover |
Keywords: Modeling, Control, and Learning for Soft Robots, Machine Learning for Robot Control, Optimization and Optimal Control
Abstract: Soft robots pose difficulties in terms of control, requiring novel strategies to effectively manipulate their compliant structures. Model-based approaches face challenges due to the high dimensionality and nonlinearities such as hysteresis effects. In contrast, learning-based approaches provide nonlinear models of different soft robots based only on measured data. In this paper, recurrent neural networks (RNNs) predict the behavior of an articulated soft robot (ASR) with five degrees of freedom (DoF). RNNs based on gated recurrent units (GRUs) are compared to the more commonly used long short-term memory (LSTM) networks and show better accuracy. The recurrence enables to capture hysteresis effects that are inherent in soft robots due to viscoelasticity or friction but cannot be captured by simple feedforward networks. The data-driven model is used within a nonlinear model predictive control (NMPC), whereby the correct handling of the RNN's hidden states is focused. A training approach is presented that allows measured values to be utilized in each control cycle. This enables accurate predictions of short horizons based on sensor data, which is crucial for closed-loop NMPC. The proposed learning-based NMPC enables trajectory tracking with an average error of 1.2 deg in experiments with the pneumatic five-DoF ASR.
|
|
WeDT18 |
406 |
Surgical Robotics: Steerable Catheters/Needles 1 |
Regular Session |
Co-Chair: Chitalia, Yash | University of Louisville |
|
15:15-15:20, Paper WeDT18.1 | |
Towards a Tendon-Assisted Magnetically Steered (TAMS) Robotic Stylet for Brachytherapy |
|
Kheradmand, Pejman | University of Louisville |
Moradkhani, Behnam | University of Louisville |
Jella, Harshith | University of Louisville |
Sowards, Keith | Department of Radiation Oncology, University of Louisville |
Silva, Scott | University of Louisville |
Chitalia, Yash | University of Louisville |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Mechanism Design
Abstract: Interstitial brachytherapy requires up to 20 straight needles to surround and irradiate deep-seated tumors, but may offer sub-optimal radiation dosage in cases of advanced cancers. A steerable stylet can be used to guide the needle within the tissue, improving procedure accuracy and reducing the number of needles required for each operation. This work introduces the design of a novel tendon-assisted magnetically steered (TAMS) robotic stylet to steer commercially available brachytherapy needles. The dual-actuation modality (magnetic and tendon-driven) allows for increased bending compliance while retaining axial rigidity at extremely small diameters (OD: 1.4 mm), key properties for steering hollow needles from within their lumen. We also develop a two-tube Cosserat rod model that estimates the behavior of the TAMS robot and needle assembly under actuation from tendons, external magnetic fields, and finally combined magnet+tendon forces. We validate our model in free space and demonstrate the capability of the TAMS robot and dual-actuation modalities to steer brachytherapy needles to high curvatures inside phantom tissue.
|
|
15:20-15:25, Paper WeDT18.2 | |
VascularPilot3D: Toward a 3D Fully Autonomous Navigation for Endovascular Robotics |
|
Song, Jingwei | University of Michigan |
Yang, Keke | United Imaging |
Chen, Han | Shanghai United Imaging Medical High-Tech Research Institute Co |
Liu, Jiayi | United Imaging |
Gu, Yinan | Shanghai United Imaging Healthcare High Tech Research Institute |
Hui, Qianxin | Shanghai United Imaging Healthcare Advance Technology Research I |
Huang, Yanqi | Shanghai United Imaging Healthcare Co., LTD |
Li, Meng | Shanghai United Imaging Healthcare Co., Ltd |
Zhang, Zheng | 1. the Institute of Medical Imaging Technology, School of Biomed |
Cao, Tuoyu | United Imaging Healthcare |
Ghaffari, Maani | University of Michigan |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Vision-Based Navigation, Computer Vision for Medical Robotics
Abstract: This research reports VascularPilot3D, the first 3D fully autonomous endovascular robot navigation system. As an exploration toward autonomous guidewire navigation, VascularPilot3D is developed as a complete navigation system based on intra-operative imaging systems (fluoroscopic X-ray in this study) and typical endovascular robots. VascularPilot3D adopts previously researched fast 3D-2D vessel registration algorithms and guidewire segmentation methods as its perception modules. We additionally propose three modules: a topology-constrained 2D-3D instrument end-point lifting method, a tree-based fast path planning algorithm, and a prior-free endovascular navigation strategy. VascularPilot3D is compatible with most mainstream endovascular robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success rate among 25 trials. It reduces the human surgeon's overall control loops by 18.38%. VascularPilot3D is promising for general clinical autonomous endovascular navigation.
|
|
15:25-15:30, Paper WeDT18.3 | |
Weakly-Supervised Learning Via Multi-Lateral Decoder Branching for Tool Segmentation in Robot-Assisted Cardiovascular Catheterization |
|
Omisore, Olatunji Mumini | Shenzhen Institute of Advanced Technology, Chinese Academy of Sc |
Akinyemi, Toluwanimi | Shenzhen Institute of Advanced Technology |
Nguyen, Anh | University of Liverpool |
Wang, Lei | Shenzhen Institutes of Advanced Technology, Chinese Academy of S |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Object Detection, Segmentation and Categorization, Medical Robots and Systems
Abstract: Robot-assisted catheterization has garnered a good attention for its potentials in treating cardiovascular diseases. However, advancing surgeon-robot collaboration still requires further research, particularly on task-specific automation. For instance, automated tool segmentation can assist surgeons in visualizing and tracking endovascular tools during procedures. While learning-based models have demonstrated state-of-the- art segmentation performances, generating ground-truth labels for fully-supervised methods is labor-intensive, time consuming, and costly. In this study, we developed a weakly-supervised learning method that is based on multi-lateral pseudo labeling for tool segmentation in cardiovascular angiogram datasets. The method utilizes a modified U-Net architecture featuring one encoder and multiple laterally branched decoders. The decoders generate diverse pseudo labels under different perturbations to augment the available partial annotation for model training. A mixed loss function with shared consistency was adapted for this purpose. The weakly-supervised model was trained end-to-end and validated using partially annotated angiogram data from three cardiovascular catheterization procedures. Validation results show that the weakly-supervised model could perform closer to fully-supervised models. Furthermore, the proposed multi-lateral approach outperforms three well known weakly- supervised learning methods, offering the highest segmentation performance across the three angiogram datasets. Numerous ablation studies confirmed the model’s consistent performance under different settings. Finally, the model was applied for tool segmentation in a robot-assisted catheterization experiments. The model enhanced visualization with high connectivity indices for guidewire and catheter, and a mean segmentation time of 35.26ms per frame. This study provides a fast, stable, and less expensive method for tool segmentation and visualization in robotic catheterization.
|
|
15:30-15:35, Paper WeDT18.4 | |
Image-Based Compliance Control for Robotic Steering of a Ferromagnetic Guidewire |
|
Hu, An | University of Toronto |
Sun, Chen | University of Toronto |
Dmytriw, Adam | Neurovascular Centre, Divisions of Therapeutic Neuroradiology An |
Xiao, Nan | Beijing Institute of Technology |
Sun, Yu | University of Toronto |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Visual Servoing, Compliance and Impedance Control
Abstract: Robotic steering of magnetic guidewires has shown great potential in accelerating endovascular interventions, enhancing the success rate of time-sensitive surgeries such as stroke treatment. Incomplete state feedback of the guidewire from 2D perspective images and unknown interactions with the surrounding vessel wall raise challenges in modeling and steering control. These two factors, however, are commonly overlooked by existing works. In this paper, 2D perspective images of the guidewire, which comply with prevalent medical imaging modalities, are used as the only feedback. A model-based external force observer is proposed that allows the guidewire to perceive the unknown interactions, and a compliance controller is subsequently designed to handle the external force while steering the guidewire. Experiments conducted in a human-sized phantom demonstrate how the compliance controller preserves stability and safety by adapting to the estimated external force.
|
|
15:35-15:40, Paper WeDT18.5 | |
Towards Evaluating the User Comfort and Experience of a Novel Steerable Drilling Robotic System in Pedicle Screw Fixation Procedures: A User Study |
|
Sharma, Susheela | University of Texas at Austin |
Racz, Frigyes Samuel | The University of Texas at Austin |
Go, Sarah | University of Texas at Austin |
Kapuria, Siddhartha | University of Texas at Austin |
Rezayof, Omid | University of Texas at Austin |
Amadio, Jordan P. | University of Texas Dell Medical School |
Khadem, Mohsen | University of Edinburgh |
Millán, José del R. | The University of Texas at Austin |
Alambeigi, Farshid | University of Texas at Austin |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Medical Robots and Systems, Human-Robot Collaboration
Abstract: Aiming at developing a safe, intuitive, and collaborative steerable drilling robotic system for pedicle screw fixation procedures, in this paper, we leverage our recently developed steerable drilling robotic framework, and developed a collaborative drilling mode to control this system. In this control mode, first a user positions a concentric tube steerable drilling robot (CT-SDR) in the workspace and aligns it based on a pre-planned trajectory. Next, the CT-SDR is directly controlled by the user through an admittance mode to perform a drilling procedure and creating a J-shape tunnel. To evaluate the user comfort and intuitiveness of the drilling procedure using this system and the proposed control interface, we performed a user study with 11 subjects, who had no prior experience in using this system. The results of this study were analyzed using various qualitative and quantitative metrics.
|
|
15:40-15:45, Paper WeDT18.6 | |
Minimally Invasive Endotracheal Inside-Out Flexible Needle Driving System towards Microendoscope-Guided Robotic Tracheostomy |
|
Lin, Botao | The Chinese University of Hong Kong |
Yuan, Sishen | The Chinese University of Hong Kong |
Zhang, Tinghua | The Chinese University of Hong Kong |
Zhang, Tao | Chinese University of Hong Kong |
Hao, Ruoyi | The Chinese University of Hong Kong |
Yuan, Wu | The Chinese University of Hong Kong |
Lim, Chwee Ming | National University of Singapore |
Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles
Abstract: Open tracheostomy (OT) is considered the traditional way and golden standard for treating airway obstruction patients. However, OT has many unavoidable drawbacks, including strict performing scenarios, significant scarring, and the risk of surgeon infection. Percutaneous dilation tracheostomy (PDT) emerges, with advantages including a lower cost, smaller scarring, and better protection of surgeons from inflecting by aerosol. However, the outside-in puncture manner of PDT has a risk of piercing the post-tracheal wall and the esophagus with uncontrolled force. Additionally, locating tracheal rings and determining the puncture site externally can be challenging for certain patients, such as those who are obese or have undergone neck surgery, while this procedure typically relies on palpation and the surgeon's expertise. Hence, to improve the safety and simplicity of tracheostomy, a minimally-invasive endotracheal inside-out flexible needle-driving system towards microendoscope-guided robotic tracheostomy (MERT) has been proposed in this paper. Guided by an optical coherence tomography (OCT) probe and a microendoscope, the robot inserts into the trachea and performs an inside-out puncture using a flexible needle. The robot can work through a standard endotracheal tube (ETT), and the puncture direction of the flexible needle is variable. Kinematics and statics models of the flexible needle have been derived, and the minimum position errors generated in the kinematics and statics validation experiments are 0.57 pm 0.21 mm and 0.27 pm 0.21 mm. Finally, a porcine trachea puncture experiment is carried out, and the feasibility of the proposed system is verified.
|
|
15:45-15:50, Paper WeDT18.7 | |
Comparison of Classical, Neural Network and Hybrid Models for Hysteretic Single-Tendon Catheter Kinematics |
|
Wang, Yuan | Boston Children's Hospital, Harvard Medical School |
Dupont, Pierre | Children's Hospital Boston, Harvard Medical School |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Kinematics, Deep Learning Methods
Abstract: While robotic control of catheter motion can improve tip positioning accuracy, hysteresis arising from tendon friction and flexural deformation degrades kinematic modeling accuracy. In this paper, we compare the capabilities of three types of models for representing the forward and inverse kinematic maps of a clinical single-tendon cardiac catheter. Classical hysteresis models, neural networks and hybrid combinations of the two are included. Our results show that modeling accuracy is best when models are trained using motions corresponding to the anticipated clinical motions. For sinusoidal motions, recurrent neural network models provide the best performance. For point-to-point motions, however, a simple backlash model can provide comparable performance to a recurrent neural network.
|
|
WeDT19 |
407 |
Novel Methods for Mapping and Localization |
Regular Session |
Co-Chair: Kim, Ayoung | Seoul National University |
|
15:15-15:20, Paper WeDT19.1 | |
Fieldscale: Locality-Aware Field-Based Adaptive Rescaling for Thermal Infrared Image |
|
Gil, Hyeonjae | SNU |
Jeon, Myung-Hwan | UIUC |
Kim, Ayoung | Seoul National University |
Keywords: Computer Vision for Transportation, Recognition, Deep Learning for Visual Perception
Abstract: Thermal infrared (TIR) cameras are emerging as promising sensors in safety-related fields due to their robustness against external illumination. However, RAW TIR image has 14 bits of pixel depth and needs to be rescaled into 8 bits for general applications. Previous works utilize a global 1D look-up table to compute pixel-wise gain solely based on its intensity, which degrades image quality by failing to consider the local nature of the heat. We propose Fieldscale, a rescaling based on locality-aware 2D fields where both the intensity value and spatial context of each pixel within an image are embedded. It can adaptively determine the pixel gain for each region and produce spatially consistent 8-bit rescaled images with minimal information loss and high visibility. Consistent performance improvement on image quality assessment and two other downstream tasks support the effectiveness and usability of Fieldscale. All the codes are publicly opened to facilitate research advancements in this field. https://github.com/hyeonjaegil/fieldscale
|
|
15:20-15:25, Paper WeDT19.2 | |
Evaluating Global Geo-Alignment for Precision Learned Autonomous Vehicle Localization Using Aerial Data |
|
Yang, Yi | Nuro Inc |
Zhao, Xuran | Nuro |
Zhao, Haicheng Charles | Nuro |
Yuan, Shumin | Nuro AI |
Bateman, Samuel | Nuro |
Huang, Tiffany A. | Mercedes-Benz Research & Development North America |
Beall, Chris | Georgia Institute of Technology |
Maddern, Will | Nuro |
Keywords: Localization, Mapping, Intelligent Transportation Systems
Abstract: Recently there has been growing interest in the use of aerial and satellite map data for autonomous vehicles, primarily due to its potential for significant cost reduction and enhanced scalability. Despite the advantages, aerial data also comes with challenges such as a sensor-modality gap and a viewpoint difference gap. Learned localization methods have shown promise for overcoming these challenges to provide precise metric localization for autonomous vehicles. Most learned localization methods rely on coarsely aligned ground truth, or implicit consistency-based methods to learn the localization task – however, in this paper we find that improving the alignment between aerial data and autonomous vehicle sensor data at training time is critical to the performance of a learning-based localization system. We compare two data alignment methods using a factor graph framework and, using these methods, we then evaluate the effects of closely aligned ground truth on learned localization accuracy through ablation studies. Finally, we evaluate a learned localization system using the data alignment methods on a comprehensive (1600km) autonomous vehicle dataset and demonstrate localization error below 0.3m and 0.5◦ sufficient for autonomous vehicle applications.
|
|
15:25-15:30, Paper WeDT19.3 | |
Under Pressure: Altimeter-Aided ICP for 3D Maps Consistency |
|
Dubois, William | Université Laval |
Samson, Nicolas | Université Laval |
Daum, Effie | Université Laval |
Laconte, Johann | French National Research Institute for Agriculture, Food and The |
Pomerleau, Francois | Université Laval |
Keywords: Localization, Mapping, Field Robots
Abstract: We propose a novel method to enhance the accuracy of the Iterative Closest Point (ICP) algorithm by integrating altitude constraints from a barometric pressure sensor. While ICP is widely used in mobile robotics for Simultaneous Localization and Mapping (SLAM), it is susceptible to drift, especially in underconstrained environments such as vertical shafts. To address this issue, we propose to augment ICP with altimeter measurements, reliably constraining drifts along the gravity vector. To demonstrate the potential of altimetry in SLAM, we offer an analysis of calibration procedures and noise sensitivity of various pressure sensors, improving measurements to centimeter-level accuracy. Leveraging this accuracy, we propose a novel ICP formulation that integrates altitude measurements along the gravity vector, thus simplifying the optimization problem to 3-Degree Of Freedom (DOF). Experimental results from real-world deployments demonstrate that our method reduces vertical drift by 84% and improves overall localization accuracy compared to state-of-the-art methods in non-planar environments.
|
|
15:30-15:35, Paper WeDT19.4 | |
Neural Ranging Inertial Odometry |
|
Wang, Si | Zhejiang University |
Shen, Bingqi | Zhejiang University |
Wang, Fei | Beijing Institute of Electronic System Engineering |
Cao, Yanjun | Zhejiang University, Huzhou Institute of Zhejiang University |
Xiong, Rong | Zhejiang University |
Wang, Yue | Zhejiang University |
Keywords: Localization, Range Sensing, Deep Learning Methods
Abstract: Ultra-wideband (UWB) has shown promising potential in GPS-denied localization thanks to its lightweight and drift-free characteristics, while the accuracy is limited in real scenarios due to its sensitivity to sensor arrangement and non-Gaussian pattern induced by multi-path or multi-signal interference, which commonly occurs in many typical applications like long tunnels. We introduce a novel neural fusion framework for ranging inertial odometry which involves a graph attention UWB network and a recurrent neural inertial network. Our graph net learns scene-relevant ranging patterns and adapts to any number of anchors or tags, realizing accurate positioning without calibration. Additionally, the integration of least squares and the incorporation of nominal frame enhance overall performance and scalability. The effectiveness and robustness of our methods are validated through extensive experiments on both public and self-collected datasets, spanning indoor, outdoor, and tunnel environments. The results demonstrate the superiority of our proposed IR-ULSG in handling challenging conditions, including scenarios outside the convex envelope and cases where only a single anchor is available.
|
|
15:35-15:40, Paper WeDT19.5 | |
Robust Preintegrated Wheel Odometry for Off-Road Autonomous Ground Vehicles |
|
Potokar, Easton | Carnegie Mellon Uiversity |
McGann, Daniel | Carnegie Mellon University |
Kaess, Michael | Carnegie Mellon University |
Keywords: Localization, Wheeled Robots, Field Robots
Abstract: Wheel odometry is not often used in state estimation for off-road vehicles due to frequent wheel slippage, varying wheel radii, and the 3D motion of the vehicle not fitting with the 2D nature of integrated wheel odometry. This paper attempts to overcome these issues by proposing a novel 3D preintegration of wheel encoder measurements on manifold. Our method additionally estimates wheel slip, radii, and baseline online to improve accuracy and robustness. Further, due to the preintegration, many measurements can be summarized into a single motion constraint using first-order updates for wheel slippage and intrinsics, allowing for efficient usage in an optimization-based state estimation framework. While our method can be used with any sensors in a factor graph framework, we validate its effectiveness and observability of parameters in a vision-wheel-odometry system (VWO) in a Monte Carlo simulation. Additionally, we illustrate its accuracy and demonstrate it can be used to overcome other sensor failures in real-world off-road scenarios in both a VWO and visual-inertial-wheel odometry (VIWO) system.
|
|
15:40-15:45, Paper WeDT19.6 | |
Air-Ground Collaboration with SPOMP: Semantic Panoramic Online Mapping and Planning (I) |
|
Miller, Ian | Burro |
Cladera, Fernando | University of Pennsylvania |
Smith, Trey | NASA Ames Research Center |
Taylor, Camillo Jose | University of Pennsylvania |
Kumar, Vijay | University of Pennsylvania |
Keywords: Field Robots, Multi-Robot Systems, Aerial Systems: Perception and Autonomy
Abstract: Mapping and navigation have gone hand-in-hand since long before robots existed. Maps are a key form of communication, allowing someone who has never been somewhere to nonetheless navigate that area successfully. In the context of multirobot systems, the maps and information that flow between robots are necessary for effective collaboration, whether those robots are operating concurrently, sequentially, or completely asynchronously. In this article, we argue that maps must go beyond encoding purely geometric or visual information to enable increasingly complex autonomy, particularly between robots. We propose a framework for multirobot autonomy, focusing in particular on air and ground robots operating in outdoor 2.5-D environments. We show that semantic maps can enable the specification, planning, and execution of complex collaborative missions, including localization in Global Positioning System (GPS)-denied settings. A distinguishing characteristic of this work is that we strongly emphasize field experiments and testing, and by doing so demonstrate that these ideas can work at scale in the real world. We also perform extensive simulation experiments to validate our ideas at even larger scales. We believe that these experiments and the experimental results constitute a significant step forward toward advancing the state of the art of large-scale, collaborative multirobot systems operating with real communication, navigation, and perception constraints.
|
|
15:45-15:50, Paper WeDT19.7 | |
Visual-Inertial Localization Leveraging Skylight Polarization Pattern Constraints |
|
Wan, Zhenhua | Guangxi University |
Fu, Peng | Tsinghua University |
Wang, Kunfeng | Tsinghua University |
Zhao, Kaichun | Tsinghua University |
Keywords: Localization, Visual-Inertial SLAM, Sensor Fusion
Abstract: In this letter, we develop a tightly coupled polarization-visual-inertial localization system that utilizes naturally-attributed polarized skylight to provide a global heading. We introduce a focal plane polarization camera with negligible instantaneous field-of-view error to collect polarized skylight. Then, we design a robust heading determination method from polarized skylight and construct a global stable heading constraint. In particular, this constraint compensates for the heading unobservability present in standard VINS. In addition to the standard sparse visual feature measurements used in VINS, polarization heading residuals are constructed and co-optimized in a tightly-coupled VINS update. An adaptive fusion strategy is designed to correct the cumulative drift. Outdoor real-world experiments show that the proposed method outperforms state-of-the-art VINS-Fusion in terms of localization accuracy, and improves 22% over VINS-Fusion in a wooded campus environment.
|
|
WeDT20 |
408 |
Human-Robot Interaction: Physiological Sensing |
Regular Session |
Co-Chair: Lagomarsino, Marta | Istituto Italiano Di Tecnologia |
|
15:15-15:20, Paper WeDT20.1 | |
Promoting Trust in Industrial Human-Robot Collaboration through Preference-Based Optimization |
|
Campagna, Giulio | Aalborg University |
Lagomarsino, Marta | Istituto Italiano Di Tecnologia |
Lorenzini, Marta | Istituto Italiano Di Tecnologia |
Chrysostomou, Dimitrios | Aalborg University |
Rehm, Matthias | Aalborg University |
Ajoudani, Arash | Istituto Italiano Di Tecnologia |
Keywords: Human Factors and Human-in-the-Loop, Acceptability and Trust, Human-Robot Collaboration
Abstract: This paper proposes a novel theoretical framework for promoting trust in human-robot collaboration (HRC). The framework exploits Preference-Based Optimization (PBO) and focuses on three key interaction parameters: robot velocity profile, human-robot separation distance, and vertical proximity to the user’s head. By iteratively refining these parameters based on qualitative feedback from human collaborators, the system dynamically adapts robot trajectories. This personalization aims to enhance users’ confidence in the robot’s actions and foster a more trusting collaborative environment. In our user study with fourteen participants, we simulated a chemical industrial scenario for the HRC task. Results suggest that the framework effectively promotes human operator confidence in the robot assistant, particularly for individuals with limited prior experience in robotics.
|
|
15:20-15:25, Paper WeDT20.2 | |
GazeHTA: End-To-End Gaze Target Detection with Head-Target Association |
|
Lin, Zhi-Yi | Delft University of Technology |
Chew, Jouh Yeong | Honda Research Institute Japan |
van Gemert, Jan C. | TU Delft |
Zhang, Xucong | Delft University of Technology |
Keywords: Intention Recognition, Gesture, Posture and Facial Expressions, Deep Learning for Visual Perception
Abstract: Precisely detecting which object a person is paying attention to is critical for human-robot interaction since it provides important cues for the next action from the human user. We propose an end-to-end approach for gaze target detection: predicting a head-target connection between individuals and the target image regions they are looking at. Most of the existing methods use independent components such as off-the-shelf head detectors or have problems in establishing associations between heads and gaze targets. In contrast, we investigate an end-to-end multi-person Gaze target detection framework with Heads and Targets Association (GazeHTA), which predicts multiple head-target instances based solely on input scene image. GazeHTA addresses challenges in gaze target detection by (1) leveraging a pre-trained diffusion model to extract scene features for rich semantic understanding, (2) re-injecting a head feature to enhance the head priors for improved head understanding, and (3) learning a connection map as the explicit visual associations between heads and gaze targets. Our extensive experimental results demonstrate that GazeHTA outperforms state-of-the-art gaze target detection methods and two adapted diffusion-based baselines on two standard datasets.
|
|
15:25-15:30, Paper WeDT20.3 | |
Gaze and Go: Harnessing Visual Attention Valence in Upper-Limb Robotic Rehabilitation with Tailored Gamification and Eye Tracking for Neuroplasticity |
|
Wang, Daomiao | Fudan University |
He, Peidong | University of Shanghai for Science and Technology |
Wang, Yixi | Shanghai ZD MedTech Co., Ltd |
Jian, Zhuo | Shanghai ZD MEDTECH |
Song, Zilong | Fudan University |
Hu, Qihan | Fudan University |
Fang, Fanfu | Changhai Hospital |
Yang, Cuiwei | Fudan University |
Wang, Daoyu | Fudan University |
Yu, Hongliu | University of Shanghai for Science and Technology |
Keywords: Human Detection and Tracking, Rehabilitation Robotics, Human-Robot Collaboration
Abstract: Therapeutic robotic systems have emerged as reliable tools for physical rehabilitation, providing variable-intensity movement assistance to patients with motor impairments. Robot-assisted rehabilitation facilitates the restoration mobility and dexterity, promotes functional neuroplasticity and potentially enables workforce reentry through training-induced cognitive and motor learning. To boost participant engagement and visuomotor coordination, we propose ArmGuider Pro, an advanced upper-limb training system that integrates hand-eye collaboration and gaze-triggered assistance within rehabilitation-tailored serious games. The system implements intuitive eye-tracking and visual-triggering strategies to align therapeutic interventions with participants' intentional focus, incorporating immersive gaming elements and adaptive control algorithms. Experimental validation demonstrates significant activation in motor and cognitive cerebral cortex regions, enhanced visual attention concentration in desired target areas (25.92% improvement), and improved trajectory adherence across sequential sessions (27.27% improvement). By harnessing visual attention valence, our proposed system could encourage neuroplasticity, supporting its viability for clinical application and widespread adaption in rehabilitation regimens.
|
|
15:30-15:35, Paper WeDT20.4 | |
Teleoperating a 6 DoF Robotic Manipulator from Head Movements |
|
Poignant, Alexis | Sorbonne Université, ISIR UMR 7222 CNRS |
Jarrassé, Nathanael | Sorbonne Université, ISIR UMR 7222 CNRS |
Morel, Guillaume | Sorbonne Université, CNRS, INSERM |
Keywords: Telerobotics and Teleoperation, Human Detection and Tracking
Abstract: This article presents an interactive control approach allowing a human user to teleoperate a robotic manipulator located nearby. With this approach, the user keeps his/her hands free, as only head movements are exploited to control the robot. The controller maps the 6 Degrees of Freedom (DoF) user's head position and orientation into the 6~DoF robot end-effector position and orientation. The robot can reach a large workspace thanks to the combination of two features. Firstly, a virtual wand between the user's head and the robot end-effector converts user's head pan-tilt rotations into large displacements of the robot end-effector center perpendicularly to the wand axis (2 DoF). Secondly, for the remaining 4 DoF (robot end-effector center displacement along the wand axis and robot en-effector orientation), real-time deformation of the virtual wand is triggered when the user reaches uncomfortable configurations due to his/her head workspace limitations. Additionally, the user gets, through an Augmented Reality (AR) Headset, a non-delayed visual feedback of the current virtual wand geometry and location. The paper includes a description of the setup and the proposed controller, detailing how the robot position/orientation is coupled to the user's head position/orientation. A set of elementary experiments with a constant-geometry wand is first presented, showing workspace limitations for some DoF. Then the wand reconfiguration is introduced in the experiments, leading to full control of 6 DoF manipulation tasks throughout a large workspace.
|
|
15:35-15:40, Paper WeDT20.5 | |
Wearable Soft Sensing Band with Stretchable Sensors for Torque Estimation and Hand Gesture Recognition |
|
Choi, Junhwan | Korea Advanced Institute of Science and Technology, (KAIST) |
Feng, Jirou | Korea Advanced Institute of Science and Technology |
Kim, Jung | KAIST |
Keywords: Human Detection and Tracking, Intention Recognition, Wearable Robotics
Abstract: This paper presents a wearable soft sensing band with stretchable sensors for monitoring muscle activity by estimating muscle volume changes. Unlike conventional surface electromyography (sEMG) sensing techniques, which require excessive pressure or adhesive electrodes, the proposed sensing method allows muscle volume variations to be detected simply by placing the device on the skin without additional pressure or adhesives. The band was evaluated in isometric-static and isometric-varying torque estimation tasks, demonstrating superior accuracy to sEMG, with a relative torque to maximum torque estimation error of less than 11.5%. In isometric-varying conditions, relative torque was estimated with an average error of 10.1% at frequencies of 0.1 Hz, 0.2 Hz, and 0.5 Hz. Furthermore, the band achieved a classification accuracy of 92.9% in recognizing ten distinct hand gestures, highlighting its capability to differentiate between multiple muscle activations. The lightweight and flexible design addresses limitations of sEMG, such as signal noise, skin irritation, and complex calibration. Experimental results validate the potential of the proposed sensing method for applications in muscle activity monitoring across healthcare, rehabilitation, and sports, and it also offers potential for use in robot teaching for reference motion generation.
|
|
15:40-15:45, Paper WeDT20.6 | |
Plug-And-Play Multi-Domain Fusion Adaptation for Cross-Subject EEG-Based Motor Imagery Classification |
|
Shi, Kecheng | The School of Automation Engineering, University of Electronic S |
Huang, Rui | University of Electronic Science and Technology of China |
Li, Zhe | University of Electronic Science and Technology of China |
Lyu, Jianzhi | University of Hamburg |
Zhao, Yang | University of Electronic Science and Technology of China |
Song, Guangkui | University of Electronic Science and Technology of China |
Cheng, Hong | University of Electronic Science and Technology |
Zhang, Jianwei | University of Hamburg |
Keywords: Brain-Machine Interfaces, Intention Recognition
Abstract: Motor imagery (MI) classification in rehabilitation brain-computer interfaces (RBCIs) faces significant challenges due to the variability of electroencephalography (EEG) signals across subjects. Existing methods typically require extensive EEG data collection from each new subject, which is time-consuming and results in poor user experience. To address this issue, this paper decompose MI EEG into subject-specific private components and shared components common across all subjects, and propose a plug-and-play domain fusion adaptive method (PPMDFA) to handle variability between subjects. In the training phase, PPMDFA introduces a Multi-Domain Fusion Graph Convolutional Network (MDFGCN) module to extract shared and private features from the MI processes of source domain subjects. In the calibration phase, the method constructs private classifiers for the target new subject using the extracted shared features combined with a small amount of labeled data. During testing, PPMDFA leverages the similarity of private components to utilize knowledge from source subjects, thereby enhancing classification accuracy for target subjects' MI. We validated the proposed method on the PhysioNet and LLMBCImotion datasets. Experimental results show that PPMDFA achieves state-of-the-art classification accuracy on both datasets, with rapid adaptation to new subjects using only 20% of the data, reaching accuracies of 73.33% and 61.62%, demonstrating strong generalization ability and robustness.
|
|
15:45-15:50, Paper WeDT20.7 | |
Learning to Communicate Functional States with Nonverbal Expressions for Improved Human-Robot Collaboration |
|
Roy, Liam | Monash University |
Croft, Elizabeth | University of Victoria |
Kulic, Dana | Monash University |
Keywords: Human-Robot Collaboration, Multi-Modal Perception for HRI, Social HRI
Abstract: Collaborative robots must effectively communicate their internal state to humans to enable a smooth interaction. Nonverbal communication is widely used to communicate information during human-robot interaction, however, such methods may also be misunderstood, leading to communication errors. In this work, we explore modulating the acoustic parameter values (pitch bend, beats per minute, beats per loop) of nonverbal auditory expressions to convey functional robot states (accomplished, progressing, stuck). We propose a reinforcement learning (RL) algorithm based on noisy human feedback to produce accurately interpreted nonverbal auditory expressions. The proposed approach was evaluated through a user study with 24 participants. The results demonstrate that: (i) Our proposed RL-based approach is able to learn suitable acoustic parameter values which improve the users’ ability to correctly identify the state of the robot. (ii) Algorithm initialization informed by previous user data can be used to significantly speed up the learning process. (iii) The method used for algorithm initialization strongly influences whether participants converge to similar sounds for each robot state. (iv) Modulation of pitch bend has the largest influence on user association between sounds and robotic states.
|
|
WeDT21 |
410 |
Vision-Language-Action Models |
Regular Session |
|
15:15-15:20, Paper WeDT21.1 | |
SpatialBot: Precise Spatial Understanding with Vision Language Models |
|
Cai, Wenxiao | Stanford University |
Ponomarenko, Iaroslav | Peking University |
Yuan, Jianhao | University of Oxford |
Li, Xiaoqi | Peking University |
Yang, Wankou | Southeast University |
Dong, Hao | Peking University |
Zhao, Bo | Shanghai Jiao Tong University |
Keywords: RGB-D Perception, Deep Learning in Grasping and Manipulation, AI-Based Methods
Abstract: Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding; however, they still struggle with spatial understanding, which is fundamental to embodied AI. In this paper, we propose SpatialBot, a model designed to enhance spatial understanding by utilizing both RGB and depth images. To train VLMs for depth perception, we introduce the SpatialQA and SpatialQA-E datasets, which include multi-level depth-related questions spanning various scenarios and embodiment tasks. SpatialBench is also developed to comprehensively evaluate VLMs' spatial understanding capabilities across different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks, and embodied AI tasks demonstrate the remarkable improvements offered by SpatialBot. The model, code, and datasets are available at https://github.com/BAAI-DCAI/SpatialBot.
|
|
15:20-15:25, Paper WeDT21.2 | |
Run-Time Observation Interventions Make Vision-Language-Action Models More Visually Robust |
|
Hancock, Asher | Princeton University |
Ren, Allen Z. | Princeton University |
Majumdar, Anirudha | Princeton University |
Keywords: Deep Learning Methods
Abstract: Vision-language-action (VLA) models trained on large-scale internet data and robot demonstrations have the potential to serve as generalist robot policies. However, despite their large-scale training, VLAs are often brittle to task-irrelevant visual details such as distractor objects or background colors. We introduce Bring Your Own VLA (BYOVLA): a run-time intervention scheme that (1) dynamically identifies regions of the input image that the model is sensitive to, and (2) minimally alters task-irrelevant regions to reduce the model’s sensitivity using automated image editing tools. Our approach is compatible with any off the shelf VLA without model finetuning or access to the model’s weights. Hardware experiments on language-instructed manipulation tasks demonstrate that BYOVLA enables state-of-the-art VLA models to nearly retain their nominal performance in the presence of distractor objects and backgrounds, which otherwise degrade task success rates by up to 60%.
|
|
15:25-15:30, Paper WeDT21.3 | |
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data |
|
Tang, Grace | University of California, Berkeley |
Rajkumar, Swetha | University of California, Berkeley |
Zhou, Yifei | University of California, Berkeley |
Walke, Homer | UC Berkeley |
Levine, Sergey | UC Berkeley |
Fang, Kuan | Cornell University |
Keywords: Deep Learning Methods, Big Data in Robotics and Automation, Deep Learning in Grasping and Manipulation
Abstract: Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points. Compared to baselines using pre-trained VLMs, our approach consistently achieves superior performance.
|
|
15:30-15:35, Paper WeDT21.4 | |
GHIL-Glue: Hierarchical Control with Filtered Subgoal Images |
|
Hatch, Kyle Beltran | Toyota Research Institute |
Balakrishna, Ashwin | Toyota Research Institute |
Mees, Oier | University of California, Berkeley |
Nair, Suraj | Stanford University |
Park, Seohong | Seohong@berkeley.edu |
Wulfe, Blake | Stanford University |
Itkina, Masha | Stanford University |
Eysenbach, Benjamin | CMU |
Levine, Sergey | UC Berkeley |
Kollar, Thomas | Toyota Research Institute |
Burchfiel, Benjamin | Toyota Research Institute |
Keywords: Machine Learning for Robot Control, Deep Learning Methods, Imitation Learning
Abstract: Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photo-realistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively “glue together” language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.
|
|
15:35-15:40, Paper WeDT21.5 | |
Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video |
|
Chavis, Zachary | University of Minnesota |
Park, Hyun Soo | Carnegie Mellon University |
Guy, Stephen J. | University of Minnesota - Twin Cities |
Keywords: Vision-Based Navigation, Deep Learning for Visual Perception, Motion and Path Planning
Abstract: Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models are limited to reasoning over objects and actions currently visible on the image plane. We present a spatial extension to the VLM, which leverages spatially-localized egocentric video demonstrations to augment VLMs in two ways --- through understanding spatial task-affordances, i.e. where an agent must be for the task to physically take place, and the localization of that task relative to the egocentric viewer. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting representation will enable robots to use egocentric sensing to navigate to, or around, physical regions of interest for novel tasks specified in natural language.
|
|
15:40-15:45, Paper WeDT21.6 | |
QUART-Online: Latency-Free Multimodal Large Language Model for Quadruped Robot Learning |
|
Tong, Xinyang | Westlake University |
Ding, Pengxiang | Westlake University |
Fan, Yiguo | Westlake University |
Wang, Donglin | Westlake University |
Zhang, Wenjie | Westlake University |
Cui, Can | Westlake University |
Sun, Mingyang | Westlake University |
Zhao, Han | Westlake University |
Zhang, Hongyin | Westlake University |
Dang, Yonghao | Beijing University of Posts and Telecommunications |
Huang, Siteng | Westlake Univerisity |
Lyu, Shangke | Westlake University |
Keywords: Perception-Action Coupling, Vision-Based Navigation, Imitation Learning
Abstract: This paper addresses the inherent inference latency challenges associated with deploying multimodal large language models (MLLM) in quadruped vision-language-action (QUAR-VLA) tasks. Our investigation reveals that conventional parameter reduction techniques ultimately impair the performance of the language foundation model during the action instruction tuning phase, making them unsuitable for this purpose. We introduce a novel latency-free quadruped MLLM model, dubbed QUART-Online, designed to enhance inference efficiency without degrading the performance of the language foundation model. By incorporating Action Chunk Discretization (ACD), we compress the original action representation space, mapping continuous action values onto a smaller set of discrete representative vectors while preserving critical information. Subsequently, we fine-tune the MLLM to integrate vision, language, and compressed actions into a unified semantic space. Experimental results demonstrate that QUART-Online operates in tandem with the existing MLLM system, achieving real-time inference at 50Hz in sync with the underlying controller frequency, significantly boosting the success rate across various tasks by 65%. Our project page is https://quart-online.github.io.
|
|
15:45-15:50, Paper WeDT21.7 | |
IntelliRMS: A Robotic Manipulation System for Domain-Specific Tasks Using Vision and Language Foundational Models |
|
Singh, Chandan Kumar | Tata Consultancy Services |
Kumar, Devesh | Tata Consultancy Services Limited |
Sanap, Vipul | TCS |
Khandelwal, Mayank | Tata Consultancy Services Limited |
Sinha, Rajesh | TCS-Noida |
Keywords: Software Architecture for Robotic and Automation, Software-Hardware Integration for Robot Systems, AI-Enabled Robotics
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced machines’ ability to understand and follow human instructions. In many tasks, LLMs have demonstrated performance that rivals human-level common sense. However, directly applying LLMs to domain-specific use cases, such as robotic pick-and-place, remains a challenge. Tasks that are intuitive for humans, who rely on prior knowledge and skills, become complex for robots. Industrial robotic applications like pick-and-place require a high degree of accuracy, often exceeding 90%. In response to these challenges in domain-specific applications, we propose IntelliRMS, a novel system-oriented architecture for instruction-following robotic manipulation. The IntelliRMS synergizes the linguistic and open-vocabulary visual capabilities of foundational models to arrive at an accurate, robust and scalable system. Further, we demonstrate the effectiveness of IntelliRMS in a real-world industrial Bin-picking scenario within the retail sector, validating its performance with a comprehensive dataset.
|
|
WeDT22 |
411 |
Deep Learning for Visual Perception 2 |
Regular Session |
Chair: Ding, Mingyu | University of North Carolina at Chapel Hill |
|
15:15-15:20, Paper WeDT22.1 | |
SCA3D: Enhancing Cross-Modal 3D Retrieval Via 3D Shape and Caption Paired Data Augmentation |
|
Ren, Junlong | The Hong Kong University of Science and Technology (Guangzhou) |
Wu, Hao | HKUST |
Xiong, Hui | HKUST(GZ) |
Wang, Hao | HKUST(GZ) |
Keywords: Deep Learning for Visual Perception, Visual Learning, Recognition
Abstract: The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover’s Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in https://github.com/3DAgentWorld/SCA3D.
|
|
15:20-15:25, Paper WeDT22.2 | |
TrajSSL: Trajectory-Enhanced Semi-Supervised 3D Object Detection |
|
Jacobson, Philip | University of California, Berkeley |
Xie, Yichen | University of California, Berkeley |
Ding, Mingyu | UC Berkeley |
Xu, Chenfeng | University of California, Berkeley |
Tomizuka, Masayoshi | University of California |
Zhan, Wei | Univeristy of California, Berkeley |
Wu, Ming | University of California, Berkeley |
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, AI-Based Methods
Abstract: Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training. In this work, we address the problem of improving pseudo-label quality through leveraging long-term temporal information captured in driving scenes. More specifically, we leverage pre-trained motion-forecasting models to generate object trajectories on pseudo-labeled data to further enhance the student model training. Our approach improves pseudo-label quality in two distinct manners: first, we suppress false positive pseudo-labels through establishing consistency across multiple frames of motion forecasting outputs. Second, we compensate for false negative detections by directly inserting predicted object tracks into the pseudo-labeled scene. Experiments on the nuScenes dataset demonstrate the effectiveness of our approach, improving the performance of standard semi-supervised approaches in a variety of settings.
|
|
15:25-15:30, Paper WeDT22.3 | |
Single-Shot Metric Depth from Focused Plenoptic Cameras |
|
Lasheras-Hernandez, Blanca | German Aerospace Center (DLR) |
Strobl, Klaus H. | German Aerospace Center (DLR) |
Izquierdo, Sergio | University of Zaragoza |
Bodenmueller, Tim | German Aerospace Center (DLR) |
Triebel, Rudolph | German Aerospace Center (DLR) |
Civera, Javier | Universidad De Zaragoza |
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision
Abstract: Metric depth estimation from visual sensors is crucial for robots to perceive, navigate, and interact with their environment. Traditional range imaging setups, such as stereo or structured light cameras, face hassles including calibration, occlusions, and hardware demands, with accuracy limited by the baseline between cameras. Single- and multiview monocular depth offers a more compact alternative, but is constrained by the unobservability of the metric scale. Light field imaging provides a promising solution for estimating metric depth by using a unique lens configuration through a single device. However, its application to single-view dense metric depth is under-addressed mainly due to the technology’s high cost, the lack of public benchmarks, and proprietary geometrical models and software. Our work explores the potential of focused plenoptic cameras for dense metric depth. We propose a novel pipeline that predicts metric depth from a single plenoptic camera shot by first generating a sparse metric point cloud using a neural network, which is then used to scale and align a dense relative depth map regressed by a foundation depth model, resulting in a dense metric depth. To validate it, we curated the Light Field & Stereo Image Dataset (LFS) of real-world light field images with stereo depth labels, filling a current gap in existing resources. Experimental results show that our pipeline produces accurate metric depth predictions, laying a solid groundwork for future research in this field.
|
|
15:30-15:35, Paper WeDT22.4 | |
TREND: Tri-Teaching for Robust Preference-Based Reinforcement Learning with Demonstrations |
|
Huang, Shuaiyi | University of Maryland, College Park |
Levy, Mara | University of Maryland, College Park |
Gupta, Anubhav | University of Maryland, College Park |
Ekpo, Daniel | University of Maryland, College Park |
Zheng, Ruijie | University of Maryland, College Park |
Shrivastava, Abhinav | University of Maryland, College Park |
Keywords: Deep Learning for Visual Perception, Deep Learning Methods
Abstract: Preference feedback collected by human or VLM annotators is often noisy, presenting a significant challenge for preference-based reinforcement learning that relies on accurate preference labels. To address this challenge, we propose TREND, a novel framework that integrates few-shot expert demonstrations with a tri-teaching strategy for effective noise mitigation. Our method trains three reward models simultaneously, where each model views its small-loss preference pairs as useful knowledge and teaches such useful pairs to its peer network for updating the parameters. Remarkably, our approach requires as few as one to three expert demonstrations to achieve high performance. We evaluate TREND on various robotic manipulation tasks, achieving up to 90% success rates even with noise levels as high as 40%, highlighting its effective robustness in handling noisy preference feedback.
|
|
15:35-15:40, Paper WeDT22.5 | |
SYNERGUARD: A Robust Framework for Point Cloud Classification Via Local Geometry and Spatial Topology |
|
Zhong, Haonan | The University of New South Wales |
Song, Wei | UNSW |
Pagnucco, Maurice | University of New South Wales |
Song, Yang | University of New South Wales |
Keywords: Deep Learning for Visual Perception, Recognition, Acceptability and Trust
Abstract: Point cloud recognition models are known to be vulnerable to adversarial attacks. The state-of-the-art defense solutions either focus on partial features of the point cloud, limiting their effectiveness, or rely heavily on known adversarial examples, reducing their generalizability, while others, like point cloud reconstruction, will degrade the classifier’s accuracy on clean examples. To address this, we introduce SYNERGUARD, a novel robust point cloud classification framework mitigating adversarial attacks by considering comprehensive geometric and topological attributes of the point cloud, without relying on known adversarial examples while attaining classification accuracies on clean examples. We comprehensively test SYNERGUARD against seven attack types from three leading adversarial attack approaches on two widely used datasets, ModelNet40 and ShapeNetPart. The results demonstrate SYNERGUARD’s superiority against existing defenses in mitigating adversarial attacks, as well as managing clean examples.
|
|
15:40-15:45, Paper WeDT22.6 | |
Is Discretization Fusion All You Need for Collaborative Perception? |
|
Yang, Kang | Renmin University of China |
Bu, Tianci | National University of Defense and Technology |
Li, Lantao | Sony (China) Limited |
Li, Chunxu | School of Information Renmin University of China |
Wang, Yongcai | Renmin University of China |
Li, Deying | Renmin University of China |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Intelligent Transportation Systems
Abstract: Collaborative perception in multi-agent system enhances overall perceptual capabilities by facilitating the exchange of complementary information among agents. Current mainstream collaborative perception methods rely on discretized feature maps to conduct fusion, which however, lacks flexibility in extracting and transmitting the informative features and can hardly focus on the informative features during fusion. To address these problems, this paper proposes a novel Anchor-Centric paradigm for Collaborative Object detection (ACCO). It avoids grid precision issues and allows more flexible and efficient anchor-centric communication and fusion. ACCO is composed by three main components: (1) Anchor featuring block (AFB) that targets to generate anchor proposals and projects prepared anchor queries to image features. (2) Anchor confidence generator (ACG) is designed to minimize communication by selecting only the features in the confident anchors to transmit. (3) A local-global fusion module, in which local fusion is anchor alignment-based fusion (LAAF) and global fusion is conducted by spatial-aware cross-attention (SACA). LAAF and SACA run in multi-layers, so agents conduct anchor-centric fusion iteratively to adjust the anchor proposals. Comprehensive experiments are conducted to evaluate ACCO on OPV2V and Dair-V2X datasets, which demonstrate ACCO's superiority in reducing the communication volume, and in improving the perception range and detection performances. We will make our code available.
|
|
15:45-15:50, Paper WeDT22.7 | |
Tri-AutoAug: Single Domain Generalization for Bird's-Eye-View 3D Object Detection through Pixel-2D-3D Features |
|
Zhao, Xue | SJTU |
Peng, Pai | Cowarobot |
Li, Xianfei | Cowarobot |
Wang, Xinbing | Shanghai Jiao Tong University |
Zhou, Chenghu | Shanghai Jiao Tong University |
Ye, Nanyang | Shanghai Jiao Tong University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Object Detection, Segmentation and Categorization
Abstract: With the increasing popularity of autonomous driving based on the Bird’s-Eye-View (BEV) representation, improving the generalization of such detection models is key for safe real-world applications. However, a realistic yet challenging scenario: Single Domain Generalization (SDG) for BEV, is still under-explored. A key ingredient for SDG is to increase data diversity via common image augmentation or adversarial data generation first. However, common image-level augmentation is not sufficient enough to ensure domain diversity in most part of latent space. The adversarial generation has the problem of unstable training or mode collapsing as well. To address these limitations, we present Tri-level Automatic Augmentation (Tri-AutoAug), a simple yet effective method to enlarge the diversity and quantity of data from image and 2D features and facilitate the model to learn more domain-invariant features in BEV space. Besides, Tri-AutoAug can automatically learn augmentation strategies to avoid spending too much time manually adjusting hyperparameters and maximize the benefit of Tri-level Augmentation. To the best of our knowledge, this is the first study to explore automatic augmentation for SDG BEV. Extensive experiments on NuScenes-C including eight testing domains have demonstrated that our approach can achieve the best performance across various domain generalization methods. More importantly, we evaluate the proposed method in real-world autonomous driving scenarios. Tri-AutoAug improves the out-of-distribution (ood) performance by 8.54% (mAP), which demonstrates that Tri-AutoAug provides a practical and feasible solution for the applications of 3D detectors in the real world.The code is available at https://github.com/ClaireTun/Tri-AutoAug.
|
|
WeDT23 |
412 |
Learning Based Planning and Control |
Regular Session |
Chair: Faigl, Jan | Czech Technical University in Prague |
Co-Chair: Zarrouk, David | Ben Gurion University |
|
15:15-15:20, Paper WeDT23.1 | |
Motion Planning for Minimally-Actuated Serial Robots |
|
Cohen, Avi | Ben Gurion University of the Negev |
Sintov, Avishai | Tel-Aviv University |
Zarrouk, David | Ben Gurion University |
Keywords: Integrated Planning and Learning, Redundant Robots, Kinematics
Abstract: Modern manipulators are acclaimed for their precision but often struggle to operate in confined spaces. This limitation has driven the development of hyper-redundant and continuum robots. While these present unique advantages, they face challenges in, for instance, weight, mechanical complexity, modeling and costs. The Minimally Actuated Serial Robot (MASR) has been proposed as a light-weight, low-cost and simpler alternative where passive joints are actuated with a Mobile Actuator (MA) moving along the arm. Yet, Inverse Kinematics (IK) and a general motion planning algorithm for the MASR have not be addressed. In this letter, we propose the MASR-RRT* motion planning algorithm specifically developed for the unique kinematics of MASR. The main component of the algorithm is a data-based model for solving the IK problem while considering minimal traverse of the MA. The model is trained solely using the forward kinematics of the MASR and does not require real data. With the model as a local-connection mechanism, MASR-RRT* minimizes a cost function expressing the action time. In a comprehensive analysis, we show that MASR-RRT* is superior in performance to the straight-forward implementation of the standard RRT*. Experiments on a real robot in different environments with obstacles validate the proposed algorithm.
|
|
15:20-15:25, Paper WeDT23.2 | |
Using Implicit Behavior Cloning and Dynamic Movement Primitive to Facilitate Reinforcement Learning for Robot Motion Planning |
|
Zhang, Zengjie | Eindhoven University of Technology |
Hong, Jayden | Uvic ACIS Lab |
Soufi Enayati, Amir Mehdi | University of Victoria |
Najjaran, Homayoun | University of Victoria |
Keywords: Efficient Reinforcement Learning, Motion and Path Planning, Learning and Adaptive Systems, Learning from Demonstration
Abstract: Reinforcement learning (RL) for motion planning of multi-degree-of-freedom robots still suffers from low efficiency in terms of slow training speed and poor generalizability. In this paper, we propose a novel RL-based robot motion planning framework that uses implicit behavior cloning (IBC) and dynamic movement primitive (DMP) to improve the training speed and generalizability of an off-policy RL agent. IBC utilizes human demonstration data to leverage the training speed of RL, and DMP serves as a heuristic model that transfers motion planning into a simpler planning space. To support this, we also create a human demonstration dataset using a pick-and-place experiment that can be used for similar studies. Comparison studies reveal the advantage of the proposed method over the conventional RL agents with faster training speed and higher scores. A real-robot experiment indicates the applicability of the proposed method to a simple assembly task. Our work provides a novel perspective on using motion primitives and human demonstration to leverage the performance of RL for robot applications.
|
|
15:25-15:30, Paper WeDT23.3 | |
Interpretable Active Inference Gait Control Learning |
|
Szadkowski, Rudolf | Czech Technical University in Prague |
Faigl, Jan | Czech Technical University in Prague |
Keywords: Bioinspired Robot Learning, Probabilistic Inference, Learning from Experience
Abstract: Sustaining the gait locomotion in an adversarial environment requires the robot to react to novel experiences adaptively. In Free Energy Principle (FEP), the behavioral reaction is driven by the discrepancy between observation and prediction. Although, for legged robot gait locomotion, the prediction of gait dynamics is challenging as the consequences non-linearly depend on the activity history, the animal gait is robust, adapting to severe motion disruptions seemingly instantly. In biomimetic robotics, the Central Pattern Generator (CPG) relaxes the general dynamics of body-environment interaction to the stable and repetitive dynamics of gait. Based on these observations, we propose self-learning of the gait dynamics model and FEP framework that infers state estimation and gait control. The proposed method is experimentally evaluated on a real hexapod walking robot with 18 controllable degrees of freedom. The robot learns the gait dynamics model indoors and then deploys it in outdoor navigation under various adversarial scenarios. Results show that the developed interpretable gait controller exhibits complex and real-time adaptive behavior when it encounters unknown situations.
|
|
15:30-15:35, Paper WeDT23.4 | |
DOPT: D-Learning with Off-Policy Target Toward Sample Efficiency and Fast Convergence Control |
|
Shen, Zhaolong | Beihang University |
Quan, Quan | Beihang University |
Keywords: Machine Learning for Robot Control, Deep Learning Methods, Learning Categories and Concepts
Abstract: In recent times, Lyapunov theory has been incorporated into learning-based control methods to provide a stability guarantee. However, merely satisfying the Lyapunov conditions does not fully leverage the capabilities of the Neural Network (NN) controller. Furthermore, training an effective Lyapunov candidate requires substantial data, which inherently results in sample inefficiency. To address these limitations, we propose an off-policy variant of the vanilla D-learning method that uses current and historical data to iteratively enhance the NN controller within the framework of Lyapunov theory. Our method outperforms the Deep Deterministic Policy Gradient (DDPG) and D-learning in terms of stability, sample efficiency, and the quality of the trained controllers and Lyapunov candidates.
|
|
15:35-15:40, Paper WeDT23.5 | |
DFM: Deep Fourier Mimic for Expressive Dance Motion Learning |
|
Watanabe, Ryo | SONY Group |
Li, Chenhao | ETH Zurich |
Hutter, Marco | ETH Zurich |
Keywords: Learning from Demonstration, Reinforcement Learning, Art and Entertainment Robotics
Abstract: As entertainment robots gain popularity, the demand for natural and expressive motion, particularly in dancing, continues to rise. Traditionally, dancing motions have been manually designed by artists, a process that is both labor-intensive and restricted to simple motion playback,lacking the flexibility to incorporate additional tasks such as locomotion or gaze control during dancing. To overcome these challenges, we introduce Deep Fourier Mimic (DFM), a novel method that combines advanced motion representation with Reinforcement Learning (RL) to enable smooth transitions between motions while concurrently managing auxiliary tasks during dance sequences. While previous frequency domain based motion representations have successfully encoded dance motions into latent parameters, they often impose overly rigid periodic assumptions at the local level, resulting in reduced tracking accuracy and motion expressiveness, which is a critical aspect for entertainment robots. By relaxing these locally periodic constraints, our approach not only enhances tracking precision but also facilitates smooth transitions between different motions. Furthermore, the learned RL policy that supports simultaneous base activities, such as locomotion and gaze control, allows entertainment robots to engage more dynamically and interactively with users rather than merely replaying static, pre-designed dance routines.
|
|
15:40-15:45, Paper WeDT23.6 | |
Uncertainty-Aware Deep Reinforcement Learning with Calibrated Quantile Regression and Evidential Learning |
|
Stutts, Alex Christopher | University of Illinois Chicago |
Erricolo, Danilo | University of Illinois at Chicago |
Tulabandhula, Theja | University of Illinois Chicago |
Mittal, Mohit | Meta Reality Labs |
Trivedi, Amit Ranjan | University of Illinois at Chicago (UIC), Chicago, USA |
Keywords: Deep Learning Methods, Reinforcement Learning, Planning under Uncertainty
Abstract: We present a novel statistical approach to incorporate uncertainty awareness in model-free distributional deep reinforcement learning for mission and safety-critical robotics. Deep learning predictions are influenced by uncertainties in the data, termed as aleatoric uncertainties, as well as uncertainties in the learning process and model structure, known as epistemic uncertainties. The proposed algorithm, called as Calibrated Evidential Quantile Regression in Deep-Q Networks (CEQR-DQN), addresses key challenges associated with separately estimating aleatoric and epistemic uncertainty in stochastic robotic environments. It combines deep evidential learning with quantile calibration based on the principles of conformal inference to provide explicit, sample-free computations of global uncertainty as opposed to local estimates based on simple variance. Thereby, the proposed approach overcomes limitations of traditional methods in computational and statistical efficiency and handling of out-of-distribution (OOD) observations. Tested on a suite of representative miniaturized Atari games (i.e., MinAtar), CEQR-DQN is shown to surpass similar existing frameworks in scores and learning speed. Its ability to rigorously evaluate uncertainties improves exploration strategies and can serve as a blueprint for other uncertainty-aware robotic algorithms.
|
|
15:45-15:50, Paper WeDT23.7 | |
Teaching Periodic Stable Robot Motions Generation Via Sketch |
|
Zhi, Weiming | Carnegie Mellon University |
Tang, Haozhan | Carnegie Mellon University |
Zhang, Tianyi | Carnegie Mellon University |
Johnson-Roberson, Matthew | Carnegie Mellon University |
Keywords: Machine Learning for Robot Control, Learning from Demonstration
Abstract: Contemporary robots are complex systems. Teaching novel motion patterns to robots requires specialised expertise, often entailing the careful specification of robot motion or the cumbersome design of optimisation problems. In this paper, we seek to simplify the process of generating periodic motions, by teaching robots with user sketches. In particular, we tackle the problem of teaching a robot to approach a surface and then follow cyclic motion on the surface. The limit cycle of the motion can be arbitrarily specified by a single user-provided sketch over an image from the robot’s camera, and the sketched limit cycle is then projected into the scene. To generate motion that converges to the limit cycle, we contribute the Stable Periodic Diagrammatic Teaching (SPDT) framework. SPDT models the robot’s motion as an Orbitally Asymptotically Stable (O.A.S.) dynamical system that learns to stabilise based on the diagrammatic sketch provided by the user. This is achieved by applying a differentiable and invertible function, known as a diffeomorphism, to shape a known O.A.S. system. The parameterised diffeomorphism is then optimised with respect to the Hausdorff distance between the limit cycle of our modelled system and the sketch, to produce the desired robot motion. We provide insight into the behaviour of the optimised system and empirically evaluate SPDT. Results show that we can diagrammatically teach complex cyclic motion patterns with accuracy.
|
|
WeET1 |
302 |
Autonomous Vehicles 2 |
Regular Session |
Chair: Ang Jr, Marcelo H | National University of Singapore |
Co-Chair: Shi, Weisong | University of Delaware |
|
16:35-16:40, Paper WeET1.1 | |
DriveSceneGen: Generating Diverse and Realistic Driving Scenarios from Scratch |
|
Sun, Shuo | National University of Singapore |
Gu, Zekai | National University of Singapore |
Sun, Tianchen | National University of Singapore |
Sun, Jiawei | National University of Singapore |
Yuan, Chengran | National Universtiy of Singapore |
Han, Yuhang | National University of Singapore |
Li, Dongen | National University of Singapore |
Ang Jr, Marcelo H | National University of Singapore |
Keywords: Big Data in Robotics and Automation, Simulation and Animation, Intelligent Transportation Systems
Abstract: Realistic and diverse traffic scenarios in large quantities are crucial for the development and validation of autonomous driving systems. However, owing to numerous difficulties in the data collection process and the reliance on intensive annotations, real-world datasets lack sufficient quantity and diversity to support the increasing demand for data. This work introduces DriveSceneGen, a data-driven driving scenario generation method that learns from the real-world driving dataset and generates entire dynamic driving scenarios from scratch. Experimental results on 5k generated scenarios highlight that DriveSceneGen is able to generate novel driving scenarios that align with real-world data distributions with high fidelity and diversity. To the best of our knowledge, DriveSceneGen is the first method that generates novel driving scenarios involving both static map elements and dynamic traffic participants from scratch. Extensive experiments demonstrate that our two-stage method outperforms existing state-of-the-art map generation methods and trajectory simulation methods on their respective tasks.
|
|
16:40-16:45, Paper WeET1.2 | |
AMVP: Adaptive Multi-Volume Primitives for Auto-Driving Novel View Synthesis |
|
Qi, Dexin | Xi'an Jiaotong University |
Tao, Tao | Xi'an Jiaotong University |
Zhang, Zhihong | Xi'an Jiaotong University |
Mei, Xuesong | Xi'an Jiaotong University |
Keywords: Deep Learning Methods, Visual Learning
Abstract: Synthesizing high-quality novel views is critical to extending training data for auto-driving scenes. However, existing novel view synthesis techniques rely on a single-volume radiance field with uniform spatial resolution, constraining their model capacity and resulting in artifacts in synthesized auto-driving views. This paper introduces AMVP, a novel neural representation that models auto-driving scenes using multiple local primitives with adaptive spatial resolution. AMVP addresses the lack of representation capability of detail-rich regions by adaptively subdividing the scene into multiple local volumes. Each local volume is assigned a tailored resolution based on its geometric complexity, as determined by a density prior. Subsequently, multi-volume primitives are introduced to enable sharing a global feature table among local volumes, addressing the GPU memory inefficiency caused by the duplicated allocation. In addition, the paper proposes resolution-aware confidence, a mechanism that suppresses artifacts arising from frequency ambiguity. This mechanism adaptively reduces high-frequency components based on the spatial resolution of each local volume and the distance of the sampling point from the optical center. Experimental results on benchmark auto-driving datasets demonstrate that the proposed AMVP achieves superior rendering quality while using a similar number of parameters compared to existing methods.
|
|
16:45-16:50, Paper WeET1.3 | |
EMATO: Energy-Model-Aware Trajectory Optimization for Autonomous Driving |
|
Tian, Zhaofeng | University of Delaware |
Xia, Lichen | University of Delaware |
Shi, Weisong | University of Delaware |
Keywords: Energy and Environment-Aware Automation, Autonomous Vehicle Navigation, Motion and Path Planning
Abstract: Autonomous driving currently lacks robust evidence of energy efficiency when using energy-model-agnostic trajectory planning. To address this, we explore how differential energy models can be effectively utilized under varying driving conditions to enhance energy efficiency. Furthermore, we propose an online nonlinear programming approach that optimizes polynomial trajectories generated by the Frenet polynomial method while incorporating traffic trajectory data and road slope predictions. Through case studies, quantitative analyses, and ablation studies conducted on both sedan and truck models, we demonstrate the effectiveness of the proposed method.
|
|
16:50-16:55, Paper WeET1.4 | |
Task-Oriented Pre-Training for Drivable Area Detection |
|
Ma, Fulong | The Hong Kong University of Science and Technology |
Zhao, Guoyang | HKUST(GZ) |
Qi, Weiqing | HKUST |
Liu, Ming | Hong Kong University of Science and Technology (Guangzhou) |
Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Intelligent Transportation Systems, Object Detection, Segmentation and Categorization, Semantic Scene Understanding
Abstract: Pre-training techniques play a crucial role in deep learning, enhancing models' performance across a variety of tasks. By initially training on large datasets and subsequently fine-tuning on task-specific data, pre-training provides a solid foundation for models, improving generalization abilities and accelerating convergence rates. This approach has seen significant success in the fields of natural language processing and computer vision. However, traditional pre-training methods necessitate large datasets and substantial computational resources, and they can only learn shared features through prolonged training and struggle to capture deeper, task-specific features. In this paper, we propose a task-oriented pre-training method that begins with generating redundant segmentation proposals using the Segment Anything (SAM) model. We then introduce a Specific Category Enhancement Fine-tuning (SCEF) strategy for fine-tuning the Contrastive Language-Image Pre-training (CLIP) model to select proposals most closely related to the drivable area from those generated by SAM. This approach can generate a lot of coarse training data for pre-training models, which are further fine-tuned using manually annotated data, thereby improving model's performance. Comprehensive experiments conducted on the KITTI road dataset demonstrate that our task-oriented pre-training method achieves an all-around performance improvement compared to models without pre-training. Moreover, our pre-training method not only surpasses traditional pre-training approach but also achieves the best performance compared to state-of-the-art self-training methods.
|
|
16:55-17:00, Paper WeET1.5 | |
UA-PnP: Uncertainty-Aware End-To-End Bird's Eye View Visual Perception and Prediction for Autonomous Driving |
|
Huang, Zijian | Southern University of Science and Technology |
Li, Dachuan | Southern University of Science and Technology |
Hao, Qi | Southern University of Science and Technology |
Keywords: Intelligent Transportation Systems, Computer Vision for Transportation
Abstract: Robust and accurate perception and prediction of the driving scenarios are crucial for autonomous driving vehicles (ADV). State-of-the-art ADV frameworks have evolved from conventional modular design to an end-to-end (E2E) pipeline that enables joint feature learning and optimization. However, the evaluation of uncertainties in the intermediate features propagated between perception and prediction units is missing in current E2E pipelines. Consequently, adverse and extreme environment factors may incur highly untrustworthy features that ultimately result in degraded perception and prediction. In this work, we propose a novel uncertainty-aware E2E visual perception and prediction framework that utilized Bird's Eye View (BEV) representations. A feature distribution estimation network is introduced to explicitly quantify the uncertainties in the intermediate BEV features extracted from the images. To better exploit temporal information and generate more robust features for scene prediction, an uncertainty-aware transformer is designed to utilize the guidance of the quantified feature uncertainty via the attention mechanism. In addition, an evidential decoder generates accurate future instance segmentations along with the associated uncertainties. Comprehensive experiments conducted on real-world dataset validate the superiority of our proposed framework over conventional pipelines. Codes are available at: https://github.com/Huang121381/UA-PnP.
|
|
17:00-17:05, Paper WeET1.6 | |
HGAT-CP: Heterogeneous Graph Attention Network for Collision Prediction in Autonomous Driving |
|
Jiang, Yongzhi | Beihang University |
Zhou, Bin | Beihang University |
Li, Yongwei | Beihang University |
Wu, Xinkai | Beihang University |
Xiong, Zhongxia | Beihang University |
Keywords: Intelligent Transportation Systems, Collision Avoidance, Autonomous Vehicle Navigation
Abstract: Predicting potential collision events is beneficial to ensure the driving safety of autonomous vehicles. Existing graph-based collision prediction methods rely heavily on domain knowledge and predefined semantic relations, limiting their flexibility and adaptability in complex driving scenarios. To overcome these challenges, this paper introduces a novel collision prediction framework named HGAT-CP, which integrates a Heterogeneous Graph Attention Network (HGAT) with a Long Short-Term Memory network (LSTM) to model the spatial-temporal interactions in scenes. First, the proposed method employs a data-driven scene graph embedding module to autonomously learn relationships between vehicles and lanes and construct flexible scene graphs. Then, the HGAT module utilizes a dual-level attention mechanism, operating at both the node level and type level, to capture spatial interactions without relying on predefined semantic rules. The LSTM module models temporal dependencies of the scene graph embeddings to improve the prediction of collision events over time. Experimental evaluations on public datasets demonstrate that our proposed method achieves state-of-the-art performance, outperforming existing methods across all metrics.
|
|
17:05-17:10, Paper WeET1.7 | |
SE-STDGNN: A Self-Evolving Spatial-Temporal Directed Graph Neural Network for Multi-Vehicle Trajectory Prediction |
|
Guo, Zixuan | The Chinese University of Hong Kong |
Han, Bingxin | The Chinese University of Hong Kong |
Huang, Yijun | The Chinese University of Hong Kong |
Chen, Xi | The Chinese University of Hong Kong |
Chen, Ben M. | Chinese University of Hong Kong |
Keywords: Intelligent Transportation Systems, Deep Learning Methods, Automation Technologies for Smart Cities
Abstract: Vehicle trajectory prediction (VTP) is essential for microscopic traffic risk assessment, autonomous vehicle navigation, and traffic behavior analysis. Related research leveraging learning-based methodologies has yielded notable success on various benchmark trajectory datasets. However, these models often experience performance degradation when faced with dynamic changes in traffic conditions such as vehicle density, road types, and weather conditions, as they have not been exposed to these variations during the training process. To effectively address the need for real-time adaptation in dynamic traffic scenarios, we propose a novel framework titled self-evolving spatial-temporal directed graph neural network (SE-STDGNN). This model utilizes evolving graph convolution networks (EvolveGCNs) to aggregate spatial-temporal features of vehicles and their neighbors, which are then utilized by a trajectory prediction module to forecast future trajectories. Further, a self-evolving mechanism is introduced to adjust model parameters dynamically in the real-time operation. The efficacy of SE STDGNN is validated using the public vehicle trajectory dataset AD4CHE.
|
|
17:10-17:15, Paper WeET1.8 | |
A Generalized Control Revision Method for Autonomous Driving Safety |
|
Zhu, Zehang | Tsinghua University |
Wang, Yuning | Tsinghua University |
Ke, Tianqi | School of Vehicle and Mobility, Tsinghua University |
Han, Zeyu | Tsinghua University |
Xu, Shaobing | Tsinghua University |
Xu, Qing | Tsinghua University |
Dolan, John M. | Carnegie Mellon University |
Wang, Jianqiang | Tsinghua University |
Keywords: Intelligent Transportation Systems, Robot Safety, Collision Avoidance
Abstract: Safety is one of the most crucial challenges of autonomous driving vehicles, and one solution to guarantee safety is to employ an additional control revision module after the planning backbone. Control Barrier Function (CBF) has been widely used because of its strong mathematical foundation on safety. However, the incompatibility with heterogeneous perception data and incomplete consideration of traffic scene elements make existing systems hard to be applied in dynamic and complex real-world scenarios. In this study, we introduce a generalized control revision method for autonomous driving safety, which adopts both vectorized perception and occupancy grid map as inputs and comprehensively models multiple types of traffic scene constraints based on a new proposed barrier function. Traffic elements are integrated into one unified framework, decoupled from specific scenario settings or rules. Experiments on CARLA, SUMO, and OnSite simulator prove that the proposed algorithm could realize safe control revision under complicated scenes, adapting to various planning backbones, road topologies, and risk types. Physical platform validation also verifies the real-world application feasibility.
|
|
WeET2 |
301 |
Learning-Based SLAM 2 |
Regular Session |
Chair: Kim, Donghyun | University of Massachusetts Amherst |
|
16:35-16:40, Paper WeET2.1 | |
H3-Mapping: Quasi-Heterogeneous Feature Grids for Real-Time Dense Mapping Using Hierarchical Hybrid Representation |
|
Jiang, Chenxing | The Hong Kong University of Science and Technology |
Luo, Yiming | The University of Hong Kong |
Zhou, Boyu | Southern University of Science and Technology |
Shen, Shaojie | Hong Kong University of Science and Technology |
Keywords: Mapping, RGB-D Perception, Visual Learning
Abstract: In recent years, implicit online dense mapping methods have achieved high-quality reconstruction results, showcasing great potential in robotics, AR/VR, and digital twins applications. However, existing methods struggle with slow texture modeling which limits their real-time performance. To address these limitations, we propose a NeRF-based dense mapping method that enables faster and higher-quality reconstruction. To improve texture modeling, we introduce quasi-heterogeneous feature grids, which inherit the fast querying ability of uniform feature grids while adapting to varying levels of texture complexity. Besides, we present a gradient-aided coverage-maximizing strategy for keyframe selection that enables the selected keyframes to exhibit a closer focus on rich-textured regions and a broader scope for weak-textured areas. Experimental results demonstrate that our method surpasses existing NeRF-based approaches in texture fidelity, geometry accuracy, and time consumption. The code for our method will be available at: https://github.com/SYSU-STAR/H3-Mapping.
|
|
16:40-16:45, Paper WeET2.2 | |
CEAR: Comprehensive Event Camera Dataset for Rapid Perception of Agile Quadruped Robots |
|
Zhu, Shifan | University of Massachusetts Amherst |
Xiong, Zixun | University of Massachusetts Amherst |
Kim, Donghyun | University of Massachusetts Amherst |
Keywords: Data Sets for SLAM, Data Sets for Robotic Vision, Legged Robots
Abstract: When legged robots perform agile movements, traditional RGB cameras often produce blurred images, posing a challenge for rapid perception. Event cameras have emerged as a promising solution for capturing rapid perception and coping with challenging lighting conditions thanks to their low latency, high temporal resolution, and high dynamic range. However, integrating event cameras into agile-legged robots is still largely unexplored. Notably, no dataset including event cameras has yet been developed for the context of agile quadruped robots. To bridge this gap, we introduce CEAR, a dataset comprising data from an event camera, an RGB-D camera, an IMU, a LiDAR, and joint encoders, all mounted on a dynamic quadruped, Mini Cheetah robot. This comprehensive dataset features more than 100 sequences from real-world environments, encompassing various indoor and outdoor environments, different lighting conditions, a range of robot gaits (e.g., trotting, bounding, pronking), as well as acrobatic movements like backflip. To our knowledge, this is the first event camera dataset capturing the dynamic and diverse quadruped robot motions under various setups, developed to advance research in rapid perception for quadruped robots.
|
|
16:45-16:50, Paper WeET2.3 | |
DVLO4D: Deep Visual-Lidar Odometry with Sparse Spatial-Temporal Fusion |
|
Liu, Mengmeng | University of Twente |
Yang, Michael Ying | University of Bath |
Liu, Jiuming | Shanghai Jiao Tong University |
Zhang, Yunpeng | PhiGent Robotics |
Li, Jiangtao | Phigent Robotics |
Sander, Oude Elberink | University of Twente |
Vosselman, George | University of Twente |
Cheng, Hao | University of Twente |
Keywords: Localization, Autonomous Agents, SLAM
Abstract: Visual-LiDAR odometry is a critical component for autonomous system localization, yet achieving high accuracy and strong robustness remains a challenge. Traditional approaches commonly struggle with sensor misalignment, fail to fully leverage temporal information, and require extensive manual tuning to handle diverse sensor configurations. To address these problems, we introduce DVLO4D, a novel visual-LiDAR odometry framework that leverages sparse spatial-temporal fusion to enhance accuracy and robustness. Our approach proposes three key innovations: (1) Sparse Query Fusion, which utilizes sparse LiDAR queries for effective multi-modal data fusion; (2) a Temporal Interaction and Update module that integrates temporally-predicted positions with current frame data, providing better initialization values for pose estimation and enhancing model's robustness against accumulative errors; and (3) a Temporal Clip Training strategy combined with a Collective Average Loss mechanism that aggregates losses across multiple frames, enabling global optimization and reducing the scale drift over long sequences. Extensive experiments on the KITTI and Argoverse Odometry dataset demonstrate the superiority of our proposed DVLO4D, which achieves state-of-the-art performance in terms of both pose accuracy and robustness. Additionally, our method has high efficiency, with an inference time of 82 ms, possessing the potential for the real-time deployment.
|
|
16:50-16:55, Paper WeET2.4 | |
Hier-SLAM: Scaling-Up Semantics in SLAM with a Hierarchically Categorical Gaussian Splatting |
|
Li, Boying | Shanghai Jiao Tong University |
Cai, Zhixi | Monash University |
Li, Yuan-Fang | Monash University |
Reid, Ian | University of Adelaide |
Rezatofighi, Hamid | Monash University |
Keywords: SLAM, Semantic Scene Understanding, Deep Learning for Visual Perception
Abstract: We propose Hier-SLAM, a semantic 3D Gaussian Splatting SLAM method featuring a novel hierarchical categorical representation, which enables accurate global 3D semantic mapping, scaling-up capability, and explicit semantic label prediction in the 3D world. The parameter usage in semantic SLAM systems increases significantly with the growing complexity of the environment, making it particularly challenging and costly for scene understanding. To address this problem, we introduce a novel hierarchical representation that encodes semantic information in a compact form into 3D Gaussian Splatting, leveraging the capabilities of large language models (LLMs). We further introduce a novel semantic loss designed to optimize hierarchical semantic information through both inter-level and cross-level optimization. Furthermore, we enhance the whole SLAM system, resulting in improved tracking and mapping performance. Our Hier-SLAM outperforms existing dense SLAM methods in both mapping and tracking accuracy, while achieving a 2x operation speed-up. Additionally, it achieves on-par semantic rendering performance compared to existing methods while significantly reducing storage and training time requirements. Rendering FPS impressively reaches 2,000 with semantic information and 3,000 without it. Most notably, it showcases the capability of handling the complex real-world scene with more than 500 semantic classes, highlighting its valuable scaling-up capability. The open-source code is available at https://github.com/LeeBY68/Hier-SLAM.
|
|
16:55-17:00, Paper WeET2.5 | |
CLIP-Clique: Graph-Based Correspondence Matching Augmented by Vision Language Models for Object-Based Global Localization |
|
Matsuzaki, Shigemichi | Toyota Motor Corporation |
Tanaka, Kazuhito | Toyota Motor Corporation |
Shintani, Kazuhiro | Toyota Motor Corporation |
Keywords: Localization, Semantic Scene Understanding, RGB-D Perception
Abstract: This paper proposes a method of global localization on a map with semantic object landmarks. One of the most promising approaches for localization on object maps is to use semantic graph matching using landmark descriptors calculated from the distribution of surrounding objects. These descriptors are vulnerable to misclassification and partial observations. Moreover, many existing methods rely on inlier extraction using RANSAC, which is stochastic and prone to a high outlier rate. To address the former issue, we augment the correspondence matching using Vision Language Models (VLMs). Landmark discriminability is improved by VLM embeddings, which are independent of surrounding objects. In addition, inliers are estimated deterministically using a graph-theoretic approach. We also incorporate pose calculation using the weighted least squares considering correspondence similarity and observation completeness to improve the robustness. We confirmed improvements in matching and pose estimation accuracy through experiments on ScanNet and TUM datasets.
|
|
17:00-17:05, Paper WeET2.6 | |
CLOi-Mapper: Consistent, Lightweight, Robust, and Incremental Mapper with Embedded Systems for Commercial Robot Services |
|
Noh, DongKi | LG Electronics Inc |
Lim, Hyungtae | Massachusetts Institute of Technology |
Eoh, Gyuho | Tech University of Korea |
Choi, Duckyu | KAIST |
Choi, Jeong-Sik | Seoul National University |
Lim, Hyunjun | Electronics and Telecommunication Research Institute |
Baek, Seung-Min | LG Electronics |
Myung, Hyun | KAIST (Korea Advanced Institute of Science and Technology) |
Keywords: Service Robotics, Embedded Systems for Robotic and Automation, Mapping
Abstract: In commercial autonomous service robots with several form factors, simultaneous localization and mapping (SLAM) is an essential technology for providing proper services such as cleaning and guidance. Such robots require SLAM algorithms suitable for specific applications and environments. Hence, several SLAM frameworks have been proposed to address various requirements in the past decade. However, we have encountered challenges in implementing recent innovative frameworks when handling service robots with low-end processors and insufficient sensor data, such as low-resolution 2D LiDAR sensors. Specifically, regarding commercial robots, consistent performance in different hardware configurations and environments is more crucial than the performance dedicated to specific sensors or environments. Therefore, we propose a) a multi-stage approach for global pose estimation in embedded systems; b) a graph generation method with zero constraints for synchronized sensors; and c) a robust and memory-efficient method for long-term pose-graph optimization. As verified in in-home and large-scale indoor environments, the proposed method yields consistent global pose estimation for services in commercial fields. Furthermore, the proposed method exhibits potential commercial viability considering the consistent performance verified via mass production and long-term (> 5 years) operation.
|
|
17:05-17:10, Paper WeET2.7 | |
D2S: Representing Sparse Descriptors and 3D Coordinates for Camera Relocalization |
|
Bui, Bach-Thuan | Ritsumeikan University |
Bui, Huy Hoang | Ritsumeikan University |
Tran, Dinh Tuan | College of Information Science and Engineering, Ritsumeikan Univ |
Lee, Joo-Ho | Ritsumeikan University |
Keywords: Localization, Mapping, Vision-Based Navigation
Abstract: State-of-the-art visual localization methods mostly rely on complex procedures to match local descriptors and 3D point clouds. However, these procedures can incur significant costs in terms of inference, storage, and updates over time. In this study, we propose a direct learning-based approach that utilizes a simple network named D2S to represent complex local descriptors and their scene coordinates. Our method is characterized by its simplicity and cost-effectiveness. It solely leverages a single RGB image for localization during the testing phase and only requires a lightweight model to encode a complex sparse scene. The proposed D2S employs a combination of a simple loss function and graph attention to selectively focus on robust descriptors while disregarding areas such as clouds, trees, and several dynamic objects. This selective attention enables D2S to effectively perform a binary-semantic classification for sparse descriptors. Additionally, we propose a simple outdoor dataset to evaluate the capabilities of visual localization methods in scene-specific generalization and self-updating from unlabeled observations. Our approach outperforms the previous regression-based methods in both indoor and outdoor environments. It demonstrates the ability to generalize beyond training data, including scenarios involving transitions from day to night and adapting to domain shifts. The source code, trained models, dataset, and demo videos are available at the following link: https://thpjp.github.io/d2s
|
|
WeET3 |
303 |
Offroad Navigation |
Regular Session |
|
16:35-16:40, Paper WeET3.1 | |
CAHSOR: Competence-Aware High-Speed Off-Road Ground Navigation in SE(3) |
|
Pokhrel, Anuj | George Mason University |
Nazeri, Mohammad | George Mason University |
Datar, Aniket | George Mason University |
Xiao, Xuesu | George Mason University |
Keywords: Autonomous Vehicle Navigation, Representation Learning, Field Robots
Abstract: While the workspace of traditional ground vehi- cles is usually assumed to be in a 2D plane, i.e., SE(2), such an assumption may not hold when they drive at high speeds on unstructured off-road terrain: High-speed sharp turns on high- friction surfaces may lead to vehicle rollover; Turning aggres- sively on loose gravel or grass may violate the non-holonomic constraint and cause significant lateral sliding; Driving quickly on rugged terrain will produce extensive vibration along the vertical axis. Therefore, most offroad vehicles are currently limited to driving only at low speeds to assure vehicle stability and safety. In this work, we aim at empowering high-speed off-road vehicles with competence awareness in SE(3) so that they can reason about the consequences of taking aggressive maneuvers on different terrain with a 6-DoF forward kino- dynamic model. The kinodynamic model is learned from visual, speed, and inertial Terrain Representation for Off-road Navigation ( TRON ) using multimodal, self-supervised vehicle-terrain interactions. We demonstrate the efficacy of our Competence-Aware High- Speed Off-Road ( CAHSOR ) navigation approach on a physical ground robot in both autonomous navigation and a human shared-control setup and show that CAHSOR can efficiently reduce vehicle instability by 62% while only compromising 8.6% average speed with the help of TRON .
|
|
16:40-16:45, Paper WeET3.2 | |
ROD: RGB-Only Fast and Efficient Off-Road Freespace Detection |
|
Sun, Tong | University of Chinese Academy of Sciences |
Ye, Hongliang | Zhejiang Lab |
Mei, Jilin | Institute of Computing Technology, Chinese Academy of Sciences |
Chen, Liang | Institute of Computing Technology: Beijing, CN |
Zhao, Fangzhou | Institute of Computing Technology, Chinese Academy of Sciences |
Zong, Leiqiang | Beijing Special Vehicle Academy |
Hu, Yu | Institute of Computing Technology Chinese Academy of Sciences |
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception
Abstract: Off-road freespace detection is more challenging than on-road scenarios because of the blurred boundaries of traversable areas. Previous state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and LiDAR data. However, due to the significant increase in inference time when calculating surface normal maps from LiDAR data, multi-modal methods are not suitable for real-time applications, particularly in real-world scenarios where higher FPS is required compared to slow navigation. This paper presents a novel RGB-only approach for off-road freespace detection, named ROD, eliminating the reliance on LiDAR data and its computational demands. Specifically, we utilize a pre-trained Vision Transformer (ViT) to extract rich features from RGB images. Additionally, we design a lightweight yet efficient decoder, which together improve both precision and inference speed. ROD establishes a new SOTA on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS, significantly outperforming prior models. Our code will be available at https://github.com/STLIFE97/offroad_roadseg.
|
|
16:45-16:50, Paper WeET3.3 | |
JORD: A Benchmark Dataset for Off-Road LiDAR Place Recognition and SLAM |
|
Zhou, Wei | Jilin University |
Zhang, Tongzhou | Jilin University |
Xu, Qian | China North Vehicle Research Institute |
Chen, Yu | Jilin University |
Hou, Minghui | Jilin University |
Wang, Gang | Jilin University |
Keywords: Data Sets for SLAM, SLAM, Mapping
Abstract: Simultaneous localization and mapping (SLAM) is a crucial component of unmanned systems, playing a key role in autonomous navigation. Currently, most LiDAR SLAM methods are focused on structured environments. However, highly irregular off-road terrain poses more challenges for LiDAR SLAM tasks, but these environments are not fully represented in existing datasets. To address this issue, we introduce the first dedicated LiDAR SLAM benchmark dataset for off-road environments, named Jlurobot Off-Road Dadaset (JORD). This dataset is collected using a custom avenger data collection platform in large-scale forest off-road scenes, consisting of 8 LiDAR sequences with a total length of approximately 6.07 kilometers, containing 49,144 point cloud frames along with accurate 6DoF ground truth. The dataset includes multiple revisit information within the sequences, making it suitable for LiDAR place recognition and SLAM tasks. Furthermore, we employe several state-of-the-art methods for benchmarking to validate the dataset's challenges. The release of JORD aims to provide researchers with valuable resources to develop new approaches and explore novel directions for unmanned systems in off-road environments. The complete dataset and code is available at https://github.com/jiurobots/JORD.
|
|
16:50-16:55, Paper WeET3.4 | |
Self-Reflective Perceptual Adaptation for Robust Ground Navigation in Unstructured Off-Road Environments |
|
Siva, Sriram | US Army DEVCOM Army Research Laboratory |
Youngquist, Oscar | University of Massachusetts Amherst |
Wigness, Maggie | U.S. Army Research Laboratory |
Rogers III, John G. | US Army Research Laboratory |
Zhang, Hao | University of Massachusetts Amherst |
Keywords: Vision-Based Navigation, Field Robots, Deep Learning Methods
Abstract: Autonomous ground robots navigating unstructured off-road environments face perceptual challenges, such as sensor obscuration or failure, which can lead to inaccurate perception or navigation failures. While robot adaptation has recently gained increasing attention, self-reflective robot adaptation, where robots understand and adjust to their own sensor limitations, remains under-explored. This paper proposes a novel approach for self-reflective perceptual adaptation in order to enhance robust off-road navigation. Our approach enables a robot to identify its own perceptual difficulties and dynamically adapt in challenging environments. The key novelty is learning a modality-invariant perceptual representation that encodes shared sensor data into a compact feature space. Within this representation space, the robot's dynamics model is also learned, which enables accurate prediction of future navigation paths. Extensive experiments in off-road environments with sensor obstructions and failures demonstrate that our method significantly improves adaptive capabilities and outperforms baseline and state-of-the-art approaches.
|
|
16:55-17:00, Paper WeET3.5 | |
Dynamics Modeling Using Visual Terrain Features for High-Speed Autonomous Off-Road Driving |
|
Gibson, Jason | Georgia Institute of Technology |
Alavilli, Anoushka | Carnegie Mellon University |
Tevere, Erica | Jet Propulsion Laboratory, California Institute of Technology |
Theodorou, Evangelos | Georgia Institute of Technology |
Spieler, Patrick | JPL |
Keywords: Integrated Planning and Learning, Machine Learning for Robot Control, Motion and Path Planning
Abstract: Rapid autonomous traversal of unstructured ter- rain is essential for scenarios such as disaster response, search and rescue, or planetary exploration. As a vehicle navigates at the limit of its capabilities over extreme terrain, its dynamics can change suddenly and dramatically. For example, high-speed and varying terrain can affect parameters such as traction, tire slip, and rolling resistance. To achieve effective planning in such environments, it is crucial to have a dynamics model that can accurately anticipate these conditions. In this work, we present a hybrid model that predicts the changing dynamics induced by the terrain as a function of visual inputs. We leverage a pre- trained visual foundation model (VFM) such as DINOv2, which provides rich features that encode fine-grained semantic infor- mation. To use this dynamics model for planning, we propose an end-to-end training architecture for a projection distance independent feature encoder that compresses the information from the VFM, enabling the creation of a lightweight map of the environment at runtime. We validate our architecture on an extensive dataset (hundreds of kilometers of aggressive off-road driving) collected across multiple locations as part of the DARPA Robotic Autonomy in Complex Environments with Resiliency (RACER) program.
|
|
17:00-17:05, Paper WeET3.6 | |
Digital Twins Meet the Koopman Operator: Data-Driven Learning for Robust Autonomy |
|
Samak, Chinmay | Clemson University International Center for Automotive Research |
Samak, Tanmay | Clemson University International Center for Automotive Research |
Joglekar, Ajinkya | Clemson University |
Vaidya, Umesh | Clemson University |
Krovi, Venkat | Clemson University |
Keywords: Autonomous Vehicle Navigation, Model Learning for Control, Simulation and Animation
Abstract: Contrary to on-road autonomous navigation, off-road autonomy is complicated by various factors ranging from sensing challenges to terrain variability. In such a milieu, data-driven approaches have been commonly employed to capture intricate vehicle-environment interactions effectively. However, the success of data-driven methods depends crucially on the quality and quantity of data, which can be compromised by large variability in off-road environments. To address these concerns, we present a novel methodology to recreate the exact vehicle and its target operating conditions digitally for domain-specific data generation. This enables us to effectively model off-road vehicle dynamics from simulation data using the Koopman operator theory, and employ the obtained models for local motion planning and optimal vehicle control. The capabilities of the proposed methodology are demonstrated through an autonomous navigation problem of a 1:5 scale vehicle, where a terrain-informed planner is employed for global mission planning. Results indicate a substantial improvement in off-road navigation performance with the proposed algorithm (5.84x) and underscore the efficacy of digital twinning in terms of improving the sample efficiency (3.2x) and reducing the sim2real gap (5.2%).
|
|
17:05-17:10, Paper WeET3.7 | |
Off-Road Freespace Detection with LiDAR-Camera Fusion and Self-Distillation |
|
Gu, Shuo | Nanjing University of Science and Technology |
Gao, Ming | Nanjing University of Science and Technology |
Keywords: Intelligent Transportation Systems, Semantic Scene Understanding, Sensor Fusion
Abstract: LiDAR-camera fusion has gradually become the mainstream for the freespace detection in unstructured off-road environments. However, existing methods mainly use the traditional method to densify the sparse LiDAR data in the perspective view, which introduces noise and limits the representation ability. In this paper, we propose a lightweight end-to-end freespace detection network with cascaded LiDAR-camera fusion and multi-scale self-distillation. It first performs sparse freespace detection in the range view, and then projects the range-view features onto the perspective view and densifies them. The dense features obtained are fused with camera images to get the final freespace detection results. In our method, the cascaded fusion strategy reduces the impact of resolution differences between LiDAR point clouds and camera images, and the introduction of noise during the data densification process. The multi-scale self-distillation strategy distills knowledge from the LiDAR-camera fusion module to the perspective-view module to further improve the freespace detection performance using LiDAR data only. Experiments on the off-road ORFD datasets demonstrate the effectiveness of the proposed cascaded fusion and multi-scale self-distillation strategies, our method obtains 93.4% IoU at speeds of more than 50 Hz. It also achieves state-of-the-art performance among all LiDAR-based freespace detection methods.
|
|
17:10-17:15, Paper WeET3.8 | |
Learning to Model and Plan for Wheeled Mobility on Vertically Challenging Terrain |
|
Datar, Aniket | George Mason University |
Pan, Chenhui | George Mason University |
Xiao, Xuesu | George Mason University |
Keywords: Autonomous Vehicle Navigation, Motion and Path Planning, Model Learning for Control
Abstract: Most autonomous navigation systems assume wheeled robots are rigid bodies and their 2D planar workspaces can be divided into free spaces and obstacles. However, recent wheeled mobility research, showing that wheeled platforms have the potential of moving over vertically challenging terrain (e.g., rocky outcroppings, rugged boulders, and fallen tree trunks), invalidate both assumptions. Navigating off-road vehicle chassis with long suspension travel and low tire pressure in places where the boundary between obstacles and free spaces is blurry requires precise 3D modeling of the interaction between the chassis and the terrain, which is complicated by suspension and tire deformation, varying tire-terrain friction, vehicle weight distribution and momentum, etc. In this paper, we present a learning approach to model wheeled mobility, i.e., in terms of vehicle-terrain forward dynamics, and plan feasible, stable, and efficient motion to drive over vertically challenging terrain without rolling over or getting stuck. We present physical experiments on two wheeled robots and show that planning using our learned model can achieve up to 60% improvement in navigation success rate and 46% reduction in unstable chassis roll and pitch angles.
|
|
WeET4 |
304 |
Sensor Fusion 4 |
Regular Session |
Chair: Choi, Hyouk Ryeol | Sungkyunkwan University |
Co-Chair: Huang, Guoquan (Paul) | University of Delaware |
|
16:35-16:40, Paper WeET4.1 | |
Dynamic Importance-Weighted Fusion Network Based on Dynamic Convolutions for Hand Posture Recognition: A Technique Based on Red, Green, Blue Plus Depth Cameras |
|
Qi, Jing | Beihang University |
Ma, Li | Hebei University |
Yu, Yushu | Beijing Institute of Technology |
Keywords: RGB-D Perception, Human-Robot Collaboration, Object Detection, Segmentation and Categorization
Abstract: Hand posture recognition enhances human-computer interaction, with existing algorithms mainly using RGB images or depth data. However, RGB images are affected by lighting and background, while depth data struggles to capture details, reducing accuracy. To address these issues, fusing RGB images and depth data has gained attention. Traditional fusion methods use fixed modal weights, which struggle to adapt to complex modal relationships, causing performance degradation. To resolve this, we propose a Fusion module incorporating Multi-Scale Gated Extraction (MSGE) for multi-scale feature extraction and gating, Context Sensitive Dynamic Filtering (CSDF) for dynamic weight adjustment based on modal importance, and Importance Weighted Fusion (IWF) for adaptive weighting. Based on this, this paper proposes a network that fuses RGB information and depth data, named Dynamic Importance-Weighted Fusion Network (DIWFNet). This network utilizes a dual-branch YOLOv5 framework integrated with four Fusion modules, fully leveraging the complementary nature of RGB images and depth data. Through dynamic weight distribution and adaptive feature convolution, it precisely captures and models the complex interactions between different modalities, enhancing the accuracy and robustness of hand posture recognition. Our method has shown excellent performance on the CUG dataset, NTU dataset, and self-built dataset, and has been successfully applied to robots in real operational environments.
|
|
16:40-16:45, Paper WeET4.2 | |
Robust 4D Radar-Aided Inertial Navigation for Aerial Vehicles |
|
Zhu, Jinwen | Meituan Inc |
Hu, Jun | Meituan Inc |
Zhao, Xudong | Meituan Inc |
Lang, Xiaoming | Meituan |
Mao, Yinian | Meituan-Dianping Group |
Huang, Guoquan (Paul) | University of Delaware |
Keywords: SLAM, Localization
Abstract: While LiDAR and cameras are becoming ubiquitous for unmanned aerial vehicles (UAVs) but can be ineffective in challenging environments, 4D millimeter-wave (MMW)radars that can provide robust 3D ranging and Doppler velocity measurements are less exploited for aerial navigation. In this paper, we develop an efficient and robust error-state Kalman filter (ESKF)-based radar-inertial navigation for UAVs. The key idea of the proposed approach is the point-to-distribution radar scan matching to provide motion constraints with proper uncertainty qualification, which are used to update the navigation states in a tightly coupled manner, along with the Doppler velocity measurements. Moreover, we propose a robust keyframe-based matching scheme against the prior map to bound the cumulative navigation errors and provide a radar-based global localization solution with high accuracy. Extensive real-world experimental validations have demonstrated that the proposed radar-aided inertial navigation outperforms state-of-the-art methods in both accuracy and robustness.
|
|
16:45-16:50, Paper WeET4.3 | |
Semi-Elastic LiDAR-Inertial Odometry |
|
Yuan, Zikang | Huazhong University, Wuhan, 430073, China |
Lang, Fengtian | Huazhong University of Science and Technology |
Xu, Tianle | Huazhong University of Science and Technology |
Ming, Ruiye | Huazhong University of Science and Technology |
Zhao, Chengwei | Hangzhou Guochen Robot Technology Company Limited |
Yang, Xin | Huazhong University of Science and Technology |
Keywords: SLAM, Localization, Sensor Fusion
Abstract: This work proposes a semi-elastic optimization-based LiDAR-inertial state estimation method, which balances the constraints from LiDAR, IMU and consistency according to their unique characteristics, thereby imparts appropriate elasticity for current state to be optimized to the correct value, and ensure the accuracy, consistency, and robustness of state estimation. We incorporate the proposed LiDAR-inertial state estimation method into a self-developed optimization-based LiDAR-inertial odometry (LIO) framework. Experimental results on four public datasets demonstrate that the proposed method enhances the performance of optimization-based LiDAR-inertial state estimation. We have released the source code of this work for the development of the community.
|
|
16:50-16:55, Paper WeET4.4 | |
DOGE: An Extrinsic Orientation and Gyroscope Bias Estimation for Visual-Inertial Odometry Initialization |
|
Xu, Zewen | Institute of Automation, Chinese Academy of Science |
He, Yijia | TCL RayNeo |
Wei, Hao | University of Chinese Academy of Sciences |
Wu, Yihong | National Laboratory of Pattern Recognition, InstituteofAutomatio |
Keywords: Visual-Inertial SLAM
Abstract: Most existing visual-inertial odometry (VIO) initialization methods rely on accurate pre-calibrated extrinsic parameters. However, during long-term use, irreversible structural deformation caused by temperature changes, mechanical squeezing, etc. will cause changes in extrinsic parameters, especially in the rotational part. Existing initialization methods that simultaneously estimate extrinsic parameters suffer from poor robustness, low precision, and long initialization latency due to the need for sufficient translational motion. To address these problems, we propose a novel VIO initialization method, which jointly considers extrinsic orientation and gyroscope bias within the normal epipolar constraints, achieving higher precision and better robustness without delayed rotational calibration. First, a rotation-only constraint is designed for extrinsic orientation and gyroscope bias estimation, which tightly couples gyroscope measurements and visual observations and can be solved in pure-rotation cases. Second, we propose a weighting strategy together with a failure detection strategy to enhance the precision and robustness of the estimator. Finally, we leverage Maximum A Posteriori to refine the results before enough translation parallax comes. Extensive experiments have demonstrated that our method outperforms the state-of-the-art methods in both accuracy and robustness while maintaining competitive efficiency.
|
|
16:55-17:00, Paper WeET4.5 | |
GaRLIO: Gravity Enhanced Radar-LiDAR-Inertial Odometry |
|
Noh, Chiyun | Seoul National University |
Yang, Wooseong | Seoul National University |
Jung, Minwoo | Seoul National University |
Jung, Sangwoo | Seoul National University |
Kim, Ayoung | Seoul National University |
Keywords: SLAM, Localization, Range Sensing
Abstract: Recently, gravity has been highlighted as a crucial constraint for state estimation to alleviate potential vertical drift. Existing online gravity estimation methods rely on pose estimation combined with IMU measurements, which is considered best practice when direct velocity measurements are unavailable. However, with radar sensors providing direct velocity data—a measurement not yet utilized for gravity estimation—we found a significant opportunity to improve gravity estimation accuracy substantially. GaRLIO, the proposed gravity- enhanced Radar-LiDAR-Inertial Odometry, can robustly predict gravity to reduce vertical drift while simultaneously enhancing state estimation performance using pointwise velocity measurements. Furthermore, GaRLIO ensures robustness in dynamic environments by utilizing radar to remove dynamic objects from LiDAR point clouds. Our method is validated through experiments in various environments prone to vertical drift, demonstrating superior performance compared to traditional LiDAR-Inertial Odometry methods. We make our source code publicly available to encourage further research and development. https://github.com/ChiyunNoh/GaRLIO
|
|
17:00-17:05, Paper WeET4.6 | |
AF-RLIO: Adaptive Fusion of Radar-LiDAR-Inertial Information for Robust Odometry in Challenging Environments |
|
Qian, Chenglong | Zhejiang University of Technology |
Xu, Yang | Zhejiang University |
Shi, Xiufang | Zhejiang University of Technology |
Chen, Jiming | Zhejiang University |
Li, Liang | Zhejiang Univerisity |
Keywords: SLAM, Sensor Fusion, Localization
Abstract: In robotic navigation, maintaining precise pose estimation and navigation in complex and dynamic environments is crucial. However, environmental challenges such as smoke, tunnels, and adverse weather can significantly degrade the performance of single-sensor systems like LiDAR or GPS, compromising the overall stability and safety of autonomous robots. To address these challenges, we propose AF-RLIO: an adaptive fusion approach that integrates 4D millimeter-wave radar, LiDAR, inertial measurement unit (IMU), and GPS to leverage the complementary strengths of these sensors for robust odometry estimation in complex environments. Our method consists of three key modules. Firstly, the pre-processing module utilizes radar data to assist LiDAR in removing dynamic points and determining when environmental conditions are degraded for LiDAR. Secondly, the dynamic-aware multimodal odometry selects appropriate point cloud data for scan-to-map matching and tightly couples it with the IMU using the Iterative Error State Kalman Filter. Lastly, the factor graph optimization module balances weights between odometry and GPS data, constructing a pose graph for optimization. The proposed approach has been evaluated on datasets and tested in real-world robotic environments, demonstrating its effectiveness and advantages over existing methods in challenging conditions such as smoke and tunnels. Furthermore, we open source our code at https://github.com/NeSC-IV/AF-RLIO.git to benefit the research community.
|
|
17:05-17:10, Paper WeET4.7 | |
Adaptive Measurement Model-Based Fusion of Capacitive Proximity Sensor and LiDAR for Improved Mobile Robot Perception |
|
Kang, Hyunchang | Sungkyunkwan University |
Yim, Hongsik | Sungkyunkwan University |
Sung, HyukJae | SUNGKYUNKWAN UNIVERSITY |
Choi, Hyouk Ryeol | Sungkyunkwan University |
Keywords: Sensor Fusion, Human-Robot Collaboration, Robot Safety
Abstract: This study introduces a novel algorithm that combines a custom-developed capacitive proximity sensor with LiDAR. This integration targets the limitations of using single-sensor systems for mobile robot perception. Our approach deals with the non-Gaussian distribution that arises during the nonlinear transformation of capacitive sensor data into distance measurements. The non-Gaussian distribution resulting from this nonlinear transformation is linearized using a first-order Taylor approximation, creating a measurement model unique to our sensor. This method helps establish a linear relationship between capacitance values and their corresponding distance measurements. Assuming that the capacitance’s standard deviation remains constant, it is modeled as a distance function. By linearizing the capacitance data and synthesizing it with LiDAR data using Gaussian methods, we fuse the sensor information to enhance integration. This results in more precise and robust distance measurements than those obtained through traditional Extended Kalman Filter (EKF) and Adaptive Extended Kalman Filter (AEKF) methods. The proposed algorithm is designed for real-time data processing, significantly improving the robot’s state estimation accuracy and stability in various environments. This study offers a reliable method for positional estimation of mobile robots, showcasing outstanding fusion performance in complex settings.
|
|
WeET5 |
305 |
Aerial Robots 3 |
Regular Session |
Chair: Schoellig, Angela P. | TU Munich |
Co-Chair: Jagannatha Sanket, Nitin | Worcester Polytechnic Institute |
|
16:35-16:40, Paper WeET5.1 | |
Robust Attitude Control with Fixed Exponential Rate of Convergence and Consideration of Motor Dynamics for Tilt Quadrotor Using Quaternions (I) |
|
Seshasayanan, Sathyanarayanan | Indian Institute of Technology Kanpur |
De, Souradip | Assistant Professor, Mnnit Allahabad |
Sahoo, Soumya Ranjan | Indian Institute of Technology Kanpur |
Keywords: Aerial Systems: Mechanics and Control, Robust/Adaptive Control
Abstract: In the existing literature on the robust control design of UAV systems, the controllers are designed without considering motor dynamics. Hence, if these controller gains are not correctly tuned, the system undergoes oscillation and may even go unstable. We have demonstrated this through an experiment in this work. Here, we propose a novel control strategy that considers actuator parameter uncertainties, including motor dynamics for a tilt quadrotor. This strategy is based on the traditional two-loop control scheme where the inner loop controls the angular velocity, and the outer loop controls the vehicle’s attitude based on quaternions. In the quaternion-based controller, usually, the convergence rate increases when the quaternion starts closer to its equilibrium point, thus making it challenging to design a linear controller for the inner loop. To overcome this, we propose a nonlinear control with a varying gain for the outer loop that ensures the quaternion has a fixed convergence rate. We propose the control design of the inner loop, which consists of a disturbance observer (DOB) and a linear controller. The DOB is optimally designed to minimize external disturbances in the presence of model uncertainties. With the DOB, a linear controller is designed for the inner loop, guaranteeing robust stability and performance against the model and actuator parameter uncertainties. The results of experimental flights are reported in this paper.
|
|
16:40-16:45, Paper WeET5.2 | |
Flying through Moving Gates without Full State Estimation |
|
Römer, Ralf | Technical University of Munich |
Emmert, Tim | TU Munich |
Schoellig, Angela P. | TU Munich |
Keywords: Aerial Systems: Mechanics and Control, Vision-Based Navigation
Abstract: Autonomous drone racing requires powerful perception, planning, and control and has become a benchmark and test field for autonomous, agile flight. Existing work usually assumes static race tracks with known maps, which enables offline planning of time-optimal trajectories, performing localization to the gates to reduce the drift in visual-inertial odometry (VIO) for state estimation or training learning-based methods for the particular race track and operating environment. In contrast, many real-world tasks like disaster response or delivery need to be performed in unknown and dynamic environments. To make drone racing more robust against unseen environments and moving gates, we propose a control algorithm that operates without a race track map or VIO, relying solely on monocular measurements of the line of sight to the gates. For this purpose, we adopt the law of proportional navigation (PN) to accurately fly through the gates despite gate motions or wind. We formulate the PN-informed vision-based control problem for drone racing as a constrained optimization problem and derive a closed-form optimal solution. Through simulations and real-world experiments, we demonstrate that our algorithm can navigate through moving gates at high speeds while being robust to different gate movements, model errors, wind, and delays.
|
|
16:45-16:50, Paper WeET5.3 | |
Collapsible Airfoil Single Actuator ROtor-Craft (CASARO) - Construction and Analysis of a Soft Rotary Wing Robot |
|
Ang, Wei Jun | Singapore University of Technology & Design |
Tang, Emmanuel | Singapore University of Technology & Design |
Ng, Matthew | Singapore University of Technology and Design |
Foong, Shaohui | Singapore University of Technology and Design |
Keywords: Aerial Systems: Applications, Biologically-Inspired Robots, Soft Robot Materials and Design
Abstract: In this paper, a soft rotary wing robot capable of flight and control is presented. The Collapsible Airfoil Single Actuator ROtor-craft (CASARO) is a single actuator monocopter that derives its geometric properties from the Samara seed. CASARO achieves better flight efficiency, lift, and handling ergonomics by reducing its overall volume by 91.7% when collapsed and stowed. Unlike conventional rotorcraft, CASARO uses a non-rigid fabric wing to produce lift in flight. It utilizes the robot’s rotational velocity to maintain tension within its fabric and airframe, providing adequate lift during its hover state. The conception, design, construction, and control of the soft monowing are demonstrated, including its capability to reduce its footprint with its soft fabric construction. To analyze the flight dynamics of CASARO, the craft is flown indoors autonomously, tracking its wing surface, craft body attitude, and position with various step inputs to observe different wing dynamics. CASARO is also capable of being deployed outdoors for real-life human-operated flight.
|
|
16:50-16:55, Paper WeET5.4 | |
VizFlyt: Perception-Centric Pedagogical Framework for Autonomous Aerial Robots |
|
Srivastava, Kushagra | Worcester Polytechnic Institute |
Kulkarni, Rutwik Sudhakar | Worcester Polytechnic Institute |
Velmurugan, Manoj | Worcester Polytechnic Institute |
Jagannatha Sanket, Nitin | Worcester Polytechnic Institute |
Keywords: Aerial Systems: Perception and Autonomy, Education Robotics, Aerial Systems: Applications
Abstract: Autonomous aerial robots are becoming commonplace in our lives. Hands-on aerial robotics courses are pivotal in training the next-generation workforce to meet the growing market demands. Such an efficient and compelling course depends on a reliable testbed. In this paper, we present VizFlyt, an open-source perception-centric Hardware-In-The-Loop (HITL) photorealistic testing framework for aerial robotics courses. We utilize pose from an external localization system to hallucinate real-time and photorealistic visual sensors using 3D Gaussian Splatting. This enables stress-free testing of autonomy algorithms on aerial robots without the risk of crashing into obstacles. We achieve over 100Hz of system update rate. Lastly, we build upon our past experiences of offering hands-on aerial robotics courses and propose a new open-source and open-hardware curriculum based on VizFlyt for the future. We test our framework on various course projects in real-world HITL experiments and present the results showing the efficacy of such a system and its large potential use cases. Code, datasets, hardware guides and demo videos are available at https://pear.wpi.edu/research/vizflyt.html
|
|
16:55-17:00, Paper WeET5.5 | |
Distributed Loitering Synchronization with Fixed-Wing UAVs |
|
AlKatheeri, Ahmed | NA |
Barcis, Agata | Technology Innovation Institute |
Ferrante, Eliseo | Vrije Universiteit Amsterdam |
Keywords: Distributed Robot Systems, Swarm Robotics, Multi-Robot Systems
Abstract: Distributed loitering synchronization is the process whereby a group of fixed-wing Unmanned Aerial Vehicles (UAVs) align with each other while they follow a circular path in the air. This process is essential to establish proper initial conditions for missions in the real world. We evaluate the performance of three synchronization algorithms using a setup of continuously moving fixed-wing drones randomly placed around a loitering circle. We consider the algorithm based on distributed consensus as a baseline. We propose two methods: the Minimum Of Shortest Arc (MOSA) algorithm that outperforms the baseline in this setup and Firefly multi-Pulse Synchronization (FPS), which is inspired by firefly synchronization. The latter method requires 10 times less communication while maintaining a performance comparable to the baseline. These algorithms were first tested in a simple simulation, then a more realistic simulation environment using Gazebo in which fixed-wing dynamics are considered. The proposed algorithms are rigorously tested in simulation through multiple trials involving a group of 10 UAVs, confirming the effectiveness of our approaches. The results were then validated in real flights using 3 fixed-wing drones. Index Terms— Fixed-Wing UAVs, Distributed Synchronization, Multi-Robot Systems, Pulse-Coupled Oscillators
|
|
17:00-17:05, Paper WeET5.6 | |
A Map-Free Deep Learning-Based Framework for Gate-To-Gate Monocular Visual Navigation Aboard Miniaturized Aerial Vehicles |
|
Scarciglia, Lorenzo | SUPSI, IDSIA |
Paolillo, Antonio | IDSIA USI-SUPSI |
Palossi, Daniele | ETH Zurich |
Keywords: Aerial Systems: Applications, Micro/Nano Robots, Deep Learning for Visual Perception
Abstract: Palm-sized autonomous nano-drones, i.e., sub-50 g in weight, recently entered the drone racing scenario, where they are tasked to avoid obstacles and navigate as fast as possible through gates. However, in contrast with their bigger counterparts, i.e., kg-scale drones, nano-drones expose three orders of magnitude less onboard memory and compute power, demanding more efficient and lightweight vision-based pipelines to win the race. This work presents a map-free vision-based (using only a monocular camera) autonomous nano-drone that combines a real-time deep learning gate detection front-end with a classic yet elegant and effective visual servoing control back-end, only relying on onboard resources. Starting from two state-of-the-art tiny deep learning models, we adapt them for our specific task, and after a mixed simulator-real-world training, we integrate and deploy them aboard our nano-drone. Our best-performing pipeline costs of only 24 M multiply- accumulate operations per frame, resulting in a closed-loop control performance of 30 Hz, while achieving a gate detection root mean square error of 1.4 pixels, on our∼20 k real-world image dataset. In-field experiments highlight the capability of our nano-drone to successfully navigate through 15 gates in 4 min, never crashing and covering a total travel distance of ∼100 m, with a peak flight speed of 1.9 m/s. Finally, to stress the generalization capability of our system, we also test it in a never-seen-before environment, where it navigates through gates for more than 4 min.
|
|
17:05-17:10, Paper WeET5.7 | |
Agile Fixed-Wing UAVs for Urban Swarm Operations (I) |
|
Basescu, Max | Johns Hopkins University Applied Physics Lab |
Polevoy, Adam | Johns Hopkins University Applied Physics Lab |
Yeh, Bryanna | The Johns Hopkins University Applied Physics Laboratory |
Scheuer, Luca | Johns Hopkins University Applied Physics Lab |
Sutton, Erin | Johns Hopkins University Applied Physics Laboratory |
Moore, Joseph | Johns Hopkins University |
Keywords: Aerial Systems: Perception and Autonomy, Aerial Systems: Mechanics and Control, Aerial Systems: Applications
Abstract: Fixed-wing uncrewed aerial vehicles (UAVs) offer significant performance advantages over rotary-wing UAVs in terms of speed, endurance, and efficiency. Such attributes make these vehicles ideally suited for long-range or high-speed reconnaissance operations and position them as valuable complementary members of a heterogeneous multi-robot team. However, these vehicles have traditionally been severely limited with regards to both vertical take-off and landing (VTOL) as well as maneuverability, which greatly restricts their utility in environments characterized by complex obstacle fields (e.g., forests or urban centers). This paper describes a set of algorithms and hardware advancements that enable agile fixed-wing UAVs to operate as members of a swarm in complex urban environments. At the core of our approach is a direct nonlinear model predictive control (NMPC) algorithm that is capable of controlling fixed-wing UAVs through aggressive post-stall maneuvers. We demonstrate in hardware how our online planning and control technique can enable navigation through tight corridors and in close proximity to obstacles. We also demonstrate how our approach can be combined with onboard stereo vision to enable high speed flight in unknown environments. Finally, we describe our method for achieving swarm system integration; this includes a gimballed propeller design to facilitate automatic take-off, a precision deep-stall landing capability, and multi-vehicle collision avoidance.
|
|
WeET6 |
307 |
Learning for Legged Locomotion 1 |
Regular Session |
Chair: Atanasov, Nikolay | University of California, San Diego |
Co-Chair: Wang, Xiaolong | UC San Diego |
|
16:35-16:40, Paper WeET6.1 | |
Offline Adaptation of Quadrupeds Using Diffusion Models |
|
O'Mahoney, Reece | University of Oxford |
Mitchell, Alexander Luis | University of Oxford |
Yu, Wanming | University of Oxford |
Posner, Ingmar | Oxford University |
Havoutis, Ioannis | University of Oxford |
Keywords: Legged Robots, Imitation Learning, Machine Learning for Robot Control
Abstract: We present a diffusion-based approach to quadrupedal locomotion that simultaneously addresses the limitations of learning and interpolating between multiple skills (modes) and of offline adapting to new locomotion behaviours after training. This is the first framework to apply classifier-guided diffusion to quadruped locomotion and demonstrate its efficacy by extracting goal-conditioned behaviour from an originally unlabelled dataset. We show that these capabilities are compatible with a multi-skill policy and can be applied with little modification. We verify the validity of our approach with hardware experiments on the ANYmal quadruped platform.
|
|
16:40-16:45, Paper WeET6.2 | |
High-Performance Reinforcement Learning on Spot: Optimizing Simulation Parameters with Distributional Measures |
|
Miller, A.J. | Massachusetts Institute of Technology |
Yu, Fangzhou | Robotics and AI Institute |
Brauckmann, Michael | AI Institute |
Farshidian, Farbod | Robotics and AI Institute |
Keywords: Reinforcement Learning, Legged Robots, Deep Learning Methods
Abstract: This work presents an overview of the technical details behind a high-performance reinforcement learning policy deployment with the Spot RL Researcher Development Kit for low-level motor access on Boston Dynamic’s Spot. This represents the first public demonstration of an end-to-end reinforcement learning policy deployed on Spot hardware with training code publicly available through Nvidia IsaacLab and deployment code available through Boston Dynamics. We utilize Wasserstein Distance and Maximum Mean Discrepancy to quantify the distributional dissimilarity of data collected on hardware and in simulation to measure our sim-to-real gap. We use these measures as a scoring function for the Covariance Matrix Adaptation Evolution Strategy to optimize simulated parameters that are unknown or difficult to measure from Spot. Our procedure for modeling and training produces high-quality reinforcement learning policies capable of multiple gaits, including a flight phase. We deploy policies capable of over 5.2m/s locomotion, more than triple Spot’s default controller maximum speed, robustness to slippery surfaces, disturbance rejection, and overall agility previously unseen on Spot. We detail our method and release our code to support future work on Spot with the low-level API.
|
|
16:45-16:50, Paper WeET6.3 | |
HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots |
|
He, Tairan | Carnegie Mellon University |
Xiao, Wenli | Carnegie Mellon University |
Lin, Toru | University of California, Berkeley |
Luo, Zhengyi | Carnegie Mellon University |
Xu, Zhenjia | Columbia University |
Jiang, Zhenyu | The Unversity of Texas at Austin |
Kautz, Jan | NVIDIA |
Liu, Changliu | Carnegie Mellon University |
Shi, Guanya | Carnegie Mellon University |
Wang, Xiaolong | UC San Diego |
Fan, Linxi | Stanford University |
Zhu, Yuke | The University of Texas at Austin |
Keywords: Reinforcement Learning, Legged Robots, Whole-Body Motion Planning and Control
Abstract: Humanoid whole-body control requires adapting to diverse tasks such as navigation, loco-manipulation, and tabletop manipulation, each demanding a different mode of control. For example, navigation relies on root velocity or position tracking, while tabletop manipulation prioritizes upper-body joint angle tracking. Existing approaches typically train individual policies tailored to a specific command space, limiting their transferability across modes. We present the key insight that full-body kinematic motion imitation can serve as a common abstraction for all these tasks and provide general-purpose motor skills for learning multiple modes of whole-body control. Building on this, we propose HOVER (Humanoid Versatile Controller), a multi-mode policy distillation framework that consolidates diverse control modes into a unified policy. HOVER enables seamless transitions between control modes while preserving the distinct advantages of each, offering a robust and scalable solution for humanoid control across a wide range of modes. By eliminating the need for policy retraining for each control mode, our approach improves efficiency and flexibility for future humanoid applications.
|
|
16:50-16:55, Paper WeET6.4 | |
Learning Humanoid Locomotion with Perceptive Internal Model |
|
Long, Junfeng | Shanghai AI Laboratory |
Ren, Junli | Hong Kong University |
Shi, Moji | Delft University of Technology |
Wong, Ziseoi | Zhejiang University |
Huang, Tao | The Chinese University of Hong Kong |
Luo, Ping | The University of Hong Kong |
Pang, Jiangmiao | Shanghai AI Laboratory |
Keywords: Humanoid and Bipedal Locomotion, Humanoid Robot Systems, Reinforcement Learning
Abstract: In contrast to quadruped robots that can navigate diverse terrains using a "blind" policy, humanoid robots require accurate perception for stable locomotion due to their high degrees of freedom and inherently unstable morphology. However, incorporating perceptual signals often introduces additional disturbances to the system, potentially reducing its robustness, generalizability, and efficiency. This paper presents the Perceptive Internal Model (PIM), which relies on onboard, continuously updated elevation maps centered around the robot to perceive its surroundings. We train the policy using ground-truth obstacle heights surrounding the robot in simulation, optimizing it based on the Hybrid Internal Model (HIM), and perform inference with heights sampled from the constructed elevation map. Unlike previous methods that directly encode depth maps or raw point clouds, our approach allows the robot to perceive the terrain beneath its feet clearly and is less affected by camera movement or noise. Furthermore, since depth map rendering is not required in simulation, our method introduces minimal additional computational costs and can train the policy in 3 hours on an RTX 4090 GPU. We verify the effectiveness of our method across various humanoid robots, various indoor and outdoor terrains, stairs, and various sensor configurations. Our method can enable a humanoid robot to continuously climb stairs and has the potential to serve as a foundational algorithm for the development of future humanoid control methods.
|
|
16:55-17:00, Paper WeET6.5 | |
A Learning Framework for Diverse Legged Robot Locomotion Using Barrier-Based Style Rewards |
|
Kim, Gijeong | Korea Advanced Institute of Science and Technology, KAIST |
Lee, Yonghoon | Korea Advanced Institute of Science and Technology, KAIST |
Park, Hae-Won | Korea Advanced Institute of Science and Technology |
Keywords: Legged Robots, Humanoid and Bipedal Locomotion, Reinforcement Learning
Abstract: This work introduces a model-free reinforcement learning framework that enables various modes of motion (quadruped, tripod, or biped) and diverse tasks for legged robot locomotion. We employ a motion-style reward based on a relaxed logarithmic barrier function as a soft constraint, to bias the learning process toward the desired motion style, such as gait, foot clearance, joint position, or body height. The predefined gait cycle is encoded in a flexible manner, facilitating gait adjustments throughout the learning process. Extensive experiments demonstrate that KAIST HOUND, a 45 kg robotic system, can achieve biped, tripod, and quadruped locomotion using the proposed framework; quadrupedal capabilities include traversing uneven terrain, galloping at 4.67 m/s, and overcoming obstacles up to 58 cm (67 cm for HOUND2); bipedal capabilities include running at 3.6 m/s, carrying a 7.5 kg object, and ascending stairs-all performed without exteroceptive input.
|
|
17:00-17:05, Paper WeET6.6 | |
Full-Order Sampling-Based MPC for Torque-Level Locomotion Control Via Diffusion-Style Annealing |
|
Xue, Haoru | University of California Berkeley |
Pan, Chaoyi | Carnegie Mellon University |
Yi, Zeji | Carnegie Mellon University |
Qu, Guannan | Carnegie Mellon University |
Shi, Guanya | Carnegie Mellon University |
Keywords: Legged Robots, Optimization and Optimal Control, Machine Learning for Robot Control
Abstract: Due to high dimensionality and non-convexity, real-time optimal control using full-order dynamics models for legged robots is challenging. Therefore, Nonlinear Model Predictive Control (NMPC) approaches are often limited to reduced-order models. Sampling-based MPC has shown potential in nonconvex even discontinuous problems, but often yields suboptimal solutions with high variance, which limits its applications in high-dimensional locomotion. This work introduces DIAL-MPC (Diffusion-Inspired Annealing for Legged MPC), a sampling-based MPC framework with a novel diffusion-style annealing process. Such an annealing process is supported by the theoretical landscape analysis of Model Predictive Path Integral Control (MPPI) and the connection between MPPI and single-step diffusion. Algorithmically, DIAL-MPC iteratively refines solutions online and achieves both global coverage and local convergence. In quadrupedal torque-level control tasks, DIAL-MPC reduces the tracking error of standard MPPI by 13.4 times and outperforms reinforcement learning (RL) policies by 50% in challenging climbing tasks without any training. In particular, DIAL-MPC enables precise real-world quadrupedal jumping with payload. To the best of our knowledge, DIAL-MPC is the first training-free method that optimizes over full-order quadruped dynamics in real-time.
|
|
17:05-17:10, Paper WeET6.7 | |
WildLMa: Long Horizon Loco-Manipulation in the Wild |
|
Qiu, Ri-Zhao | University of California, San Diego |
Song, Yuchen | UC San Diego |
Peng, Xuanbin | University of California, San Diego |
Suryadevara, Sai Aneesh | University of California San Diego |
Yang, Ge | Massachusetts Institute of Technology |
Liu, Minghuan | Shanghai Jiao Tong University |
Ji, Mazeyu | UCSD |
Jia, Chengzhe | University of California SanDiego |
Yang, Ruihan | UC San Diego |
Xueyan Zou, Zou | University of California, San Diego |
Wang, Xiaolong | UC San Diego |
Keywords: Imitation Learning, Mobile Manipulation, Legged Robots
Abstract: `In-the-wild' mobile manipulation aims at deploying robots in diverse real-world environments, which requires the robot to (1) have skills that generalize across object configurations; (2) be capable of long-horizon task execution in diverse environments; and (3) perform complex manipulation beyond pick-and-place. Quadruped robots with manipulators hold promise for such an ability for the extended workspace and robust locomotion, but existing results do not investigate such a capability. This paper proposes WildLMa with three components to address these issues: (1) a learned low-level controller for VR-enabled whole-body tele-operation and traversability; (2) WildLMa-Skill -- a library of generalizable visuomotor skills acquired via imitation learning or analytical planner and (3) WildLMa-Planner -- an LLM planner that interfaces and coordinates these skills. WildLMa exploits CLIP for language-conditioned imitation learning that empirically generalizes to objects unseen in training demonstrations. We then show these skills can be effectively interfaced with an LLM planner for autonomous long-horizon execution. Besides extensive quantitative evaluation, we qualitatively demonstrate practical robot applications, such as cleaning up trash in university hallways or outdoor terrains, operating articulated objects, and rearranging items on a bookshelf.
|
|
17:10-17:15, Paper WeET6.8 | |
Variable-Frequency Model Learning and Predictive Control for Jumping Maneuvers on Legged Robots |
|
Nguyen, Chuong | University of Southern California |
Altawaitan, Abdullah | University of California San Diego |
Duong, Thai | Rice University |
Atanasov, Nikolay | University of California, San Diego |
Nguyen, Quan | University of Southern California |
Keywords: Legged Robots, Model Learning for Control
Abstract: Achieving both target accuracy and robustness in dynamic maneuvers with long flight phases, such as high or long jumps, has been a significant challenge for legged robots. To address this challenge, we propose a novel learning-based control approach consisting of model learning and model predictive control (MPC) utilizing a variable-frequency scheme. Compared to existing MPC techniques, we learn a model directly from experiments, accounting not only for leg dynamics but also for modeling errors and unknown dynamics mismatch in hardware and during contact. Additionally, learning the model with variable-frequency allows us to cover the entire flight phase and final jumping target, enhancing the prediction accuracy of the jumping trajectory. Using the learned model, we also design variable-frequency to effectively leverage different jumping phases and track the target accurately. In a total of 92 jumps on Unitree A1 robot hardware, we verify that our approach outperforms other MPCs using fixed-frequency or nominal model, reducing the jumping distance error 2 to 8 times. We also achieve jumping distance errors of less than 3 percent during continuous jumping on uneven terrain with randomly-placed perturbations of random heights (up to 4 cm or 27 percent of the robot’s standing height). Our approach obtains distance errors of 1cm to 2cm on 34 single and continuous jumps with different jumping targets and model uncertainties. Code is available at https://github.com/DRCL-USC/Learning_MPC_Jumping.
|
|
WeET7 |
309 |
Perception 3 |
Regular Session |
Co-Chair: Araujo, Helder | University of Coimbra |
|
16:35-16:40, Paper WeET7.1 | |
Drive with the Flow |
|
Mannocci, Enrico | University of Bologna |
Poggi, Matteo | University of Bologna |
Mattoccia, Stefano | University of Bologna |
Keywords: Computer Vision for Transportation, RGB-D Perception, Imitation Learning
Abstract: End-to-end autonomous driving systems have recently made rapid progress, thanks to simulators such as CARLA. They can drive without infraction of common driving rules on uncongested roads but are still struggling with dense traffic scenarios. We conjecture that this occurs because it lacks understanding of the dynamics of the surrounding vehicles, caused by the absence of explicit short-term memory within the perception path of end-to-end models. To address this challenge, we revise the perception module to explicitly model temporal information, by extending it with an auxiliary task that is well-known in computer vision research: optical flow. We generate a novel benchmark using the CARLA simulator to train our model, FlowFuser, and prove its superior ability to avoid collisions with other agents on the road.
|
|
16:40-16:45, Paper WeET7.2 | |
Potential Fields As Scene Affordance for Behavior Change-Based Visual Risk Object Identification |
|
Pao, Pang-Yuan | National Yang Ming Chiao Tung University |
Lu, Shu-Wei | National Yang Ming Chiao Tung University |
Lu, Zeyan | National Yang Ming Chiao Tung University |
Chen, Yi-Ting | National Yang Ming Chiao Tung University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Visual Learning
Abstract: We study behavior change-based visual risk object identification (Visual-ROI), a crucial formulation for Visual-ROI that aims to detect potential hazards for intelligent driving systems. Existing methods often show significant limitations in spatial accuracy and temporal consistency, stemming from an incomplete understanding of scene affordance. For example, these methods frequently misidentify vehicles that do not impact the ego vehicle as risk objects. Furthermore, existing behavior change-based methods are inefficient because they implement causal inference in the perspective image space. We propose a new framework with a Bird’s Eye View (BEV) representation to overcome the above challenges. Specifically, we utilize potential fields as scene affordance, involving repulsive forces derived from road infrastructure and traffic participants, along with attractive forces sourced from target destinations. In this work, we compute potential fields from perspective images by assigning different energy levels based on the semantic labels acquired through BEV semantic segmentation. We conduct comprehensive experiments and ablation studies, comparing the proposed method with various state-of-the-art algorithms on both synthetic and real-world datasets. Our results show a notable increase in spatial accuracy and temporal consistency, with enhancements of 20.3% and 11.6% on the RiskBench dataset, respectively. Additionally, we can improve computational efficiency by 88%. Similarly, on the nuScenes dataset, we achieve improvements of 5.4% and 7.2% in spatial and temporal consistency.
|
|
16:45-16:50, Paper WeET7.3 | |
SCAM-P: Spatial Channel Attention Module for Panoptic Driving Perception |
|
Erabati, Gopi Krishna | University of Coimbra |
Araujo, Helder | University of Coimbra |
Keywords: Intelligent Transportation Systems, Deep Learning for Visual Perception, Visual Learning
Abstract: A high-precision, high-efficiency, and lightweight panoptic driving perception system is an essential part of autonomous driving for optimal maneuver planning of the autonomous vehicle. We propose a simple, lightweight, and ef- ficient SCAM-P multi-task learning network that accomplishes three crucial tasks simultaneously for panoptic driving: vehicle detection, drivable area segmentation, and lane segmentation. To increase the representation power of the shared backbone of our multi-task network, we designed a novel SCAM module with spatially localized channel attention and channel localized spatial attention blocks. SCAM is a lightweight module that can be plugged into any CNN architecture to enhance the semantic features with negligible computational overhead. We integrate our SCAM module and design the SCAM-P network, which has a shared backbone for feature extraction and three independent heads to handle three tasks at the same time. We also designed a nano variant of our SCAM-P network to make it deployment-friendly on edge devices. Our SCAM-P network obtains competitive results on the BDD100K dataset with 81.1 % mAP50 for object detection, 91.6 % mIoU for drivable area segmentation, and 28.8 % IoU for lane segmentation. Our model is robust in various adverse weather conditions, such as rainy, snowy, and at night. Our SCAM-P network not only achieves improved performance but also runs efficiently in real-time at 230.5 FPS on the RTX 4090 GPU and 112.1 FPS on the Jetson Orin edge device.
|
|
16:50-16:55, Paper WeET7.4 | |
IROAM: Improving Roadside Monocular 3D Object Detection Learning from Autonomous Vehicle Data Domain |
|
Wang, Zhe | Institute for AI Industry Research, Tsinghua University |
Huo, Xiaoliang | Beihang University |
Fan, Siqi | Tsinghua University |
Wang, Yan | Tsinghua University |
Liu, Jingjing | Institute for AI Industry Research (AIR), Tsinghua University |
Zhang, Ya-Qin | Institute for AI Industry Research(AIR), Tsinghua University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Visual Learning
Abstract: In autonomous driving, The perception capabilities of the ego-vehicle can be improved with roadside sensors, which can provide a holistic view of the environment. However, existing monocular detection methods designed for vehicle cameras are not suitable for roadside cameras due to viewpoint domain gaps. To bridge this gap and Improve ROAdside Monocular 3D object detection, we propose IROAM, a semantic-geometry decoupled contrastive learning framework, which takes vehicle-side and roadside data as input simultaneously. IROAM has two significant modules. In-Domain Query Interaction module utilizes a transformer to learn content and depth information for each domain and outputs object queries. Cross-Domain Query Enhancement To learn better feature representations from two domains, Cross-Domain Query Enhancement decouples queries into semantic and geometry parts and only the former is used for contrastive learning. Experiments demonstrate the effectiveness of IROAM in improving roadside detector’s performance. The results validate that IROAM has the capability to learn cross-domain information.
|
|
16:55-17:00, Paper WeET7.5 | |
Fast LiDAR Data Generation with Rectified Flows |
|
Nakashima, Kazuto | Kyushu University |
Liu, Xiaowen | Kyushu University |
Miyawaki, Tomoya | Kyushu University |
Iwashita, Yumi | NASA / Caltech Jet Propulsion Laboratory |
Kurazume, Ryo | Kyushu University |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, Representation Learning
Abstract: Building LiDAR generative models holds promise as powerful data priors for restoration, scene manipulation, and scalable simulation in autonomous mobile robots. In recent years, approaches using diffusion models have emerged, significantly improving training stability and generation quality. Despite their success, diffusion models require numerous iterations of running neural networks to generate high-quality samples, making the increasing computational cost a potential barrier for robotics applications. To address this challenge, this paper presents R2Flow, a fast and high-fidelity generative model for LiDAR data. Our method is based on rectified flows that learn straight trajectories, simulating data generation with significantly fewer sampling steps compared to diffusion models. We also propose an efficient Transformer-based model architecture for processing the image representation of LiDAR range and reflectance measurements. Our experiments on unconditional LiDAR data generation using the KITTI-360 dataset demonstrate the effectiveness of our approach in terms of both efficiency and quality.
|
|
17:00-17:05, Paper WeET7.6 | |
AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving |
|
Sekkat, Ahmed Rida | IAV GmbH |
Mohan, Rohit | University of Freiburg |
Sawade, Oliver | IAV GmbH |
Matthes, Elmar | IAV GmbH |
Valada, Abhinav | University of Freiburg |
Keywords: Computer Vision for Transportation, Data Sets for Robotic Vision, Deep Learning for Visual Perception
Abstract: Unlike humans, who can effortlessly estimate the entirety of objects even when partially occluded, modern computer vision algorithms still find this aspect extremely challenging. Leveraging this amodal perception for autonomous driving remains largely untapped due to the lack of suitable datasets. The curation of these datasets is primarily hindered by significant annotation costs and mitigating annotator subjectivity in accurately labeling occluded regions. To address these limitations, we introduce AmodalSynthDrive, a synthetic multi-task multi-modal amodal perception dataset. The dataset provides multi-view camera images, 3D bounding boxes, LiDAR data, and odometry for 150 driving sequences with over 1M object annotations in diverse traffic, weather, and lighting conditions. AmodalSynthDrive supports multiple amodal scene understanding tasks including the introduced amodal depth estimation for enhanced spatial understanding. We evaluate several baselines for each of these tasks to illustrate the challenges and set up public benchmarking servers. The dataset is available at http://amodalsynthdrive.cs.uni-freiburg.de.
|
|
WeET8 |
311 |
Representation Learning 4 |
Regular Session |
Chair: Ben Amor, Heni | Arizona State University |
Co-Chair: Gan, Lu | Georgia Institute of Technology |
|
16:35-16:40, Paper WeET8.1 | |
FedEFM: Federated Endovascular Foundation Model with Unseen Data |
|
Do, Tuong | AIOZ |
Vu Huu, Nghia | AIOZ |
Jianu, Tudor | University of Liverpool |
Huang, Baoru | Imperial College London |
Vu, Minh Nhat | TU Wien, Austria |
Su, Jionglong | Xi'an Jiaotong-Liverpool University |
Tjiputra, Erman | AIOZ |
Tran, Quang | AIOZ |
Chiu, Te-Chuan | National Tsing Hua University |
Nguyen, Anh | University of Liverpool |
Keywords: Computer Vision for Medical Robotics, Deep Learning Methods
Abstract: In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. However, accurately segmenting catheter and guidewire structures is challenging due to the limited availability of labeled data. Foundation models offer a promising solution by enabling the collection of similar-domain data to train models whose weights can be fine-tuned for downstream tasks. Nonetheless, large-scale data collection for training is constrained by the necessity of maintaining patient privacy. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention. To ensure the feasibility of the training, we tackle the unseen data issue using differentiable Earth Mover's Distance within a knowledge distillation framework. Once trained, our foundation model's weights provide valuable initialization for downstream tasks, thereby enhancing task-specific performance. Intensive experiments show that our approach achieves new state-of-the-art results, contributing to advancements in endovascular intervention and robotic-assisted endovascular surgery, while addressing the critical issue of data sharing in the medical domain.
|
|
16:40-16:45, Paper WeET8.2 | |
LamPro: Multi-Prototype Representation Learning for Enhanced Visual Pattern Recognition |
|
Qi, Ji | China Mobile (Suzhou) Software Technology Co., Ltd, China |
Sun, Wei | University of Science and Technology of China |
Huang, Qihe | University of Science and Technology of China |
Zhou, Zhengyang | University of Science and Technology of China |
Wang, Yang | University of Science and Technology of China |
Keywords: Recognition, Computer Vision for Automation, Visual Learning
Abstract: Visual pattern recognition usually plays important roles in robotics and automation society where the pattern recognition relies on representation learning. Existing representation learning often neglects two important issues, the diversity of intra-class representation and under-exploited label utilization, especially the negative feedback during training process. Fortunately, prototype learning potentially raises label utilization and encourages intra-class diversity. In this paper, we investigate the intra-class diversity and effective updates in prototype learning for enhanced visual pattern recognition. Specifically, we propose a Label-aware multi-Prototype learning, LamPro, by incorporating the label awareness into both prototype formation and update to improve the representation quality. Firstly, we design a supervised contrastive learning to achieve class-discriminative representations. Secondly, we randomly initialize multiple prototypes and update the nearest prototype upon the arrival of instance, to preserve intra-class diversity. Thirdly, we propose a novel Label-guided Adaptive Updating. We separate the prototype updates from the representation optimization and exploit the label indexes to directly implement the prediction feedback. To correct the model optimization directions, we identify the negative feedback, and correct the prototype updates via queries of labels. Finally, we design a memory-based counter to alternately update these deviated prototypes. Experiments verify the effectiveness of our label-aware and joint multi-prototype updating strategies.
|
|
16:45-16:50, Paper WeET8.3 | |
SAS-Prompt: Large Language Models As Numerical Optimizers for Robot Self-Improvement |
|
Ben Amor, Heni | Arizona State University |
Graesser, Laura | Google |
Iscen, Atil | Google |
D'Ambrosio, David | Google |
Abeyruwan, Saminda Wishwajith | Google Inc |
Bewley, Alex | Google |
Zhou, Yifan | Arizona State University |
Kalirathinam, Kamalesh | Arizona State University |
Mishra, Swaroop | Google DeepMind |
Sanketi, Pannag | Google |
Keywords: Learning from Experience, Incremental Learning
Abstract: We demonstrate the ability of large language models (LLMs) to perform iterative self-improvement of robot policies. An important insight of this paper is that LLMs have a built-in ability to perform (stochastic) numerical optimization and that this property can be leveraged for explainable robot policy search. Based on this insight, we introduce the SAS Prompt (Summarize, Analyze, Synthesize) – a single prompt that enables iterative learning and adaptation of robot behavior by combining the LLM’s ability to retrieve, reason and optimize over previous robot traces in order to synthesize new, unseen behavior. Our approach can be regarded as an early example of a new family of explainable policy search methods that are entirely implemented within an LLM. We evaluate our approach both in simulation and on a real-robot table tennis task. Project website: sites.google.com/asu.edu/sas-llm/
|
|
16:50-16:55, Paper WeET8.4 | |
Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-Based Autonomous Driving |
|
Xie, Yichen | University of California, Berkeley |
Chen, Hongge | Waymo |
Meyer, Gregory P. | Motional |
Lee, Yong Jae | UW-Madison |
Wolff, Eric | Cruise |
Tomizuka, Masayoshi | University of California |
Zhan, Wei | Univeristy of California, Berkeley |
Chai, Yuning | Waymo |
Huang, Xin | MIT |
Keywords: Computer Vision for Automation, Motion and Path Planning, Representation Learning
Abstract: Multi-frame temporal inputs are important for vision-based autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D images as long as we can identify the same instance from different input frames. However, the dynamic nature of driving scenes leads to significant variance in the instance appearance and shape captured by the cameras at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations robust to the changes of distance and perspective in a long-term temporal sequence without any human annotations. In the pretraining stage, raw point clouds from LiDAR sensors are utilized to construct the instance-wise long-term temporal correspondence, which serves as guidance for the extraction of instance-level representation from the vision-based bird's-eye-view (BEV) feature map. Cohere3D encourages consistent representation for the same instance at different frames but distinguishes between different instances. We validate the effectiveness and generalizability of our algorithm by finetuning the pretrained model across key downstream autonomous driving tasks: perception, mapping, prediction, and planning. Results show a notable improvement in both data efficiency and final performance in all these tasks.
|
|
16:55-17:00, Paper WeET8.5 | |
Towards Open-Ended Robotic Exploration Using Vision-Inspired Similarity and Foundation Models |
|
Filntisis, Panagiotis Paraskevas | National Technical University of Athens |
Tsaprazlis, Efthymios | Athena Research and Innovation Center |
Oikonomou, Paris | National Technical University of Athens (NTUA) |
Mattioli, Francesco | AI2Life |
Santucci, Vieri Giuliano | Consiglio Nazionale Delle Ricerche |
Retsinas, George | National Technical University of Athens |
Maragos, Petros | National Technical University of Athens |
Keywords: Deep Learning for Visual Perception, Continual Learning, Incremental Learning
Abstract: In the domain of robotics, achieving Lifelong Open-ended Learning Autonomy (LOLA) represents a significant milestone, especially in contexts where autonomous agents must adapt to unforeseen environmental variations and evolving objectives. This paper introduces VISOR (Vision-Inspired Similarity for Open-ended Robotic exploration), a vision-based framework designed to assist robotic agents in autonomously exploring and learning from new environments and objects, whether through guided or random exploration, without reliance on predefined design considerations. In that direction, VISOR acts as a perception mediator, classifying everything a robot encounters in a scene as either known or unknown. It further identifies potential distractors (e.g., background elements), known categories, or objects specified through text seeds. By leveraging recent advancements in vision foundation models, VISOR operates in a training-free manner. It begins by segmenting a scene into its constituent entities, regardless of familiarity, and then extracts robust visual representations for each one. These representations are compared against an adaptive memory system that evolves over time; unknown objects are assigned unique IDs and added to this memory as new classes, enriching the robot's understanding of its environment. We argue that this evolving memory can facilitate guided exploration through prior knowledge, enhancing the efficiency of robotic exploration, and validate this by designing two exploration scenarios and running both simulated and real-world experiments.
|
|
17:00-17:05, Paper WeET8.6 | |
MI-HGNN: Morphology-Informed Heterogeneous Graph Neural Network for Legged Robot Contact Perception |
|
Butterfield, Daniel Chase | Georgia Institute of Tehcnology |
Garimella, Sandilya Sai | Georgia Institute of Technology |
Cheng, NaiJen | Georgia Institute of Technology |
Gan, Lu | Georgia Institute of Technology |
Keywords: Deep Learning Methods, Force and Tactile Sensing, Legged Robots
Abstract: We present a Morphology-Informed Heterogeneous Graph Neural Network (MI-HGNN) for learning-based contact perception. The architecture and connectivity of the MI-HGNN are constructed from the robot morphology, in which nodes and edges are robot joints and links, respectively. By incorporating the morphology-informed constraints into a neural network, we improve a learning-based approach using model-based knowledge. We apply the proposed MI-HGNN to two contact perception problems, and conduct extensive experiments using both real-world and simulated data collected using two quadruped robots. Our experiments demonstrate the superiority of our method in terms of effectiveness, generalization ability, model efficiency, and sample efficiency. Our MI-HGNN improved the performance of a state-of-the-art model that leverages robot morphological symmetry by 8.4% with only 0.21% of its parameters. Although MI-HGNN is applied to contact perception problems for legged robots in this work, it can be seamlessly applied to other types of multi-body dynamical systems and has the potential to improve other robot learning frameworks. Our code is made publicly available at https://github.com/lunarlab-gatech/Morphology-Informed-HGNN.
|
|
17:05-17:10, Paper WeET8.7 | |
Data-Driven Dynamics Modeling of Miniature Robotic Blimps Using Neural ODEs with Parameter Auto-Tuning |
|
Zhu, Yongjian | Peking University |
Cheng, Hao | Peking University |
Zhang, Feitian | Peking University |
Keywords: Dynamics, Calibration and Identification, Machine Learning for Robot Control
Abstract: Miniature robotic blimps, as one type of lighter-than-air aerial vehicles, have attracted increasing attention in the science and engineering community for their enhanced safety, extended endurance, and quieter operation compared to quadrotors. Accurately modeling the dynamics of these robotic blimps poses a significant challenge due to the complex aerodynamics stemming from their large lifting bodies. Traditional first-principle models have difficulty obtaining accurate aerodynamic parameters and often overlook high-order nonlinearities, thus coming to their limit in modeling the motion dynamics of miniature robotic blimps. To tackle this challenge, this letter proposes the Auto-tuning Blimp-oriented Neural Ordinary Differential Equation method (ABNODE), a data-driven approach that integrates first-principle and neural network modeling. Spiraling motion experiments of robotic blimps are conducted, comparing the ABNODE with first-principle and other data-driven benchmark models, the results of which demonstrate the effectiveness of the proposed method.
|
|
WeET9 |
312 |
Motion Planning and Control |
Regular Session |
Chair: Geng, Junyi | Pennsylvania State University |
Co-Chair: Brock, Oliver | Technische Universität Berlin |
|
16:35-16:40, Paper WeET9.1 | |
Improving the Performance of Learned Controllers in Behavior Trees Using Value Function Estimates at Switching Boundaries |
|
Kartašev, Mart | KTH Royal Institute of Technology |
Ogren, Petter | Royal Institute of Technology (KTH) |
Keywords: Behavior-Based Systems, Control Architectures and Programming, Integrated Planning and Learning
Abstract: Behavior trees represent a modular way to create an overall controller from a set of sub-controllers solving different sub-problems. These sub-controllers can be created using various methods, such as classical model based control or reinforcement learning (RL). If each sub-controller satisfies the preconditions of the next sub-controller, the overall controller will achieve the overall goal. However, even if all sub-controllers are locally optimal in achieving the preconditions of the next, with respect to some performance metric such as completion time, the overall controller might still be far from optimal with respect to the same performance metric. In this paper we show how the performance of the overall controller can be improved if we use approximations of value functions to inform the design of a sub-controller of the needs of the next one. We also show how, under certain assumptions, this leads to a globally optimal controller when the process is executed on all sub-controllers. Finally, this result also holds when some of the sub-controllers are already given, i.e., if we are constrained to use some existing sub-controllers the overall controller will be globally optimal given this constraint.
|
|
16:40-16:45, Paper WeET9.2 | |
Deliberative Control-Aware Motion Planning for Kinematic-Constrained UAVs in a Dynamic Environment |
|
Freitas, Elias José de Rezende | Universidade Federal De Minas Gerais |
Vangasse, Arthur | Universidade Federal De Minas Gerais |
Cohen, Miri Weiss | Braude College of Engineering |
Guimarães, Frederico Gadelha | UFMG |
Pimenta, Luciano | Universidade Federal De Minas Gerais |
Keywords: Constrained Motion Planning, Collision Avoidance, Motion and Path Planning
Abstract: This paper introduces a motion planning approach for navigating in a dynamic environment. The path is represented using a Non-Uniform Rational B-Spline (NURBS) to ensure smoothness, curvature continuity, and proper orientation by adjusting its parameters. A Differential Evolution algorithm optimizes the curve parameters and traversal speed at each re-planning interval, taking into account speed limits, maximum curvature, and obstacles in the environment. A constraint-based on Velocity Obstacle (VO) ensures collision-free motion, considering bounds provided by lower-level controllers. The feasibility of the approach is validated through simulations and real-world experiments with the Crazyflie 2.1 micro quadcopter.
|
|
16:45-16:50, Paper WeET9.3 | |
Robot Navigation in Unknown and Cluttered Workspace with Dynamical System Modulation in Starshaped Roadmap |
|
Chen, Kai | The Hong Kong University of Science and Technology |
Liu, Haichao | The Hong Kong University of Science and Technology |
Li, Yulin | Hong Kong University of Science and Technology(HKUST) |
Duan, Jianghua | Hong Kong University of Science and Technology |
Zhu, Lei | The Hong Kong University of Science and Technology (Guangzhou) |
Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Integrated Planning and Control, Autonomous Vehicle Navigation, Sensor-based Control
Abstract: Compared to conventional decomposition methods that use ellipses or polygons to represent free space, starshaped representation can better capture the natural distribution of sensor data, thereby exploiting a larger portion of traversable space. This paper introduces a novel motion planning and control framework for navigating robots in unknown and cluttered environments using a dynamically constructed starshaped roadmap. Our approach generates a starshaped representation of the surrounding free space from real-time sensor data using piece-wise polynomials. Additionally, an incremental roadmap maintaining the connectivity information is constructed, and a searching algorithm efficiently selects short-term goals on this roadmap. Importantly, this framework addresses deadend situations with a graph updating mechanism. To ensure safe and efficient movement within the starshaped roadmap, we propose a reactive controller based on Dynamic System Modulation (DSM). This controller facilitates smooth motion within starshaped regions and their intersections, avoiding conservative and short-sighted behaviors and allowing the system to handle intricate obstacle configurations in unknown and cluttered environments. Comprehensive evaluations in both simulations and real-world experiments show that the proposed method achieves higher success rates and reduced travel times compared to other methods. It effectively manages intricate obstacle configurations, avoiding conservative and myopic behaviors.
|
|
16:50-16:55, Paper WeET9.4 | |
Robust Planning for Autonomous Driving Via Mixed Adversarial Diffusion Predictions |
|
Zhao, Albert | University of California Los Angeles |
Soatto, Stefano | UCLA |
Keywords: Planning under Uncertainty, Robot Safety, Autonomous Vehicle Navigation
Abstract: We describe a robust planning method for autonomous driving that mixes normal and adversarial agent predictions output by a diffusion model trained for motion prediction. We first train a diffusion model to learn an unbiased distribution of normal agent behaviors. We then generate a distribution of adversarial predictions by biasing the diffusion model at test time to generate predictions that are likely to collide with a candidate plan. We score plans using expected cost with respect to a mixture distribution of normal and adversarial predictions, leading to a planner that is robust against adversarial behaviors but not overly conservative when agents behave normally. Unlike current approaches, we do not use risk measures that over-weight adversarial behaviors while placing little to no weight on low-cost normal behaviors or use hard safety constraints that may not be appropriate for all driving scenarios. We show the effectiveness of our method on single-agent and multi-agent jaywalking scenarios as well as a red light violation scenario.
|
|
16:55-17:00, Paper WeET9.5 | |
No Plan but Everything under Control: Robustly Solving Sequential Tasks with Dynamically Composed Gradient Descent |
|
Mengers, Vito | Technische Universität Berlin |
Brock, Oliver | Technische Universität Berlin |
Keywords: Integrated Planning and Control, Reactive and Sensor-Based Planning, Optimization and Optimal Control
Abstract: We introduce a novel gradient-based approach for solving sequential tasks by dynamically adjusting the underlying myopic potential field in response to feedback and the world's regularities. This adjustment implicitly considers subgoals encoded in these regularities, enabling the solution of long sequential tasks, as demonstrated by solving the traditional planning domain of Blocks World— without any planning. Unlike conventional planning methods, our feedback-driven approach adapts to uncertain and dynamic environments, as demonstrated by one hundred real-world trials involving drawer manipulation. These experiments highlight the robustness of our method compared to planning and show how interactive perception and error recovery naturally emerge from gradient descent without explicitly implementing them. This offers a computationally efficient alternative to planning for a variety of sequential tasks, while aligning with observations on biological problem-solving strategies.
|
|
17:00-17:05, Paper WeET9.6 | |
Autonomous Navigation in Ice-Covered Waters with Learned Predictions on Ship-Ice Interactions |
|
Zhong, Ninghan | University of Illinois at Urbana-Champaign |
Potenza, Alessandro | University of Manitoba |
Smith, Stephen L. | University of Waterloo |
Keywords: Integrated Planning and Learning, Marine Robotics, Motion and Path Planning
Abstract: Autonomous navigation in ice-covered waters poses significant challenges due to the frequent lack of viable collision-free trajectories. When complete obstacle avoidance is infeasible, it becomes imperative for the navigation strategy to minimize collisions. Additionally, the dynamic nature of ice, which moves in response to ship maneuvers, complicates the path planning process. To address these challenges, we propose a novel deep learning model to estimate the coarse dynamics of ice movements triggered by ship actions through occupancy estimation. To ensure real-time applicability, we propose a novel approach that caches intermediate prediction results and seamlessly integrates the predictive model into a graph search planner. We evaluate the proposed planner in both simulation and in a physical testbed against existing approaches and show that our planner significantly reduces collisions with ice when compared to the state-of-the-art. Codes and demos of this work are available at https://github.com/IvanIZ/predictive-asv-planner.
|
|
17:05-17:10, Paper WeET9.7 | |
IKap: Kinematics-Aware Planning with Imperative Learning |
|
Li, Qihang | University at Buffalo |
Chen, Zhuoqun | University of California San Diego |
Zheng, Haoze | University at Buffalo |
He, Haonan | Department of Mechanical Engineering, College of Engineering, Ca |
Zhan, Zitong | University at Buffalo, SUNY |
Su, Shaoshu | State University of New York at Buffalo |
Geng, Junyi | Pennsylvania State University |
Wang, Chen | University at Buffalo |
Keywords: Integrated Planning and Learning, Collision Avoidance, Motion and Path Planning
Abstract: Trajectory planning in robotics aims to generate collision-free pose sequences that can be reliably executed. Recently, vision-to-planning systems have gained increasing attention for their efficiency and ability to interpret and adapt to surrounding environments. However, traditional modular systems suffer from increased latency and error propagation, while purely data-driven approaches often overlook the robot's kinematic constraints. This oversight leads to discrepancies between planned trajectories and those that are executable. To address these challenges, we propose iKap, a novel vision-to-planning system that integrates the robot's kinematic model directly into the learning pipeline. iKap employs a self-supervised learning approach and incorporates the state transition model within a differentiable bi-level optimization framework. This integration ensures the network learns collision-free waypoints while satisfying kinematic constraints, enabling gradient back-propagation for end-to-end training. Our experimental results demonstrate that iKap achieves higher success rates and reduced latency compared to the state-of-the-art methods. Besides the complete system, iKap offers a visual-to-planning network that seamlessly works with various controllers, providing a robust solution for robots navigating complex environments.
|
|
17:10-17:15, Paper WeET9.8 | |
Differentiable-Optimization Based Neural Policy for Occlusion-Aware Target Tracking |
|
Masnavi, Houman | Toronto Metropolitan University |
Singh, Arun Kumar | University of Tartu |
Janabi-Sharifi, Farrokh | Ryerson University |
Keywords: Aerial Systems: Applications, Motion and Path Planning, Integrated Planning and Learning
Abstract: We propose a learned probabilistic neural policy for safe, occlusion-free target tracking. The core novelty of our work stems from the structure of our policy network that combines generative modeling based on Conditional Variational Autoencoder (CVAE) with differentiable optimization layers. The weights of the CVAE network and the parameters of the differentiable optimization can be learned in an end-to-end fashion through demonstration trajectories. We improve the state-of-the-art (SOTA) in the following respects. We show that our learned policy outperforms existing SOTA in terms of occlusion/collision avoidance capabilities and computation time. Second, we present an extensive ablation showing how different components of our learning pipeline contribute to the overall tracking task. We also demonstrate the real-time performance of our approach on resource-constrained hardware such as NVIDIA Jetson TX2. Finally, our learned policy can also be viewed as a reactive planner for navigation in highly cluttered environments.
|
|
WeET10 |
313 |
Multi-Robot Planning |
Regular Session |
Chair: Li, Jiaoyang | Carnegie Mellon University |
|
16:35-16:40, Paper WeET10.1 | |
Multi-Horizon Multi-Agent Planning Using Decentralised Monte Carlo Tree Search |
|
Seiler, Konstantin M | University of Technology Sydney |
Kong, Felix Honglim | The University of Technology Sydney |
Fitch, Robert | University of Technology Sydney |
Keywords: Multi-Robot Systems, Path Planning for Multiple Mobile Robots or Agents
Abstract: We propose multi-horizon Monte Carlo tree search (MH-MCTS), the first framework for integrated hierarchical multi-horizon, multi-agent planning based on Monte Carlo tree search (MCTS). The method employs multiple simultaneous MCTS optimisations for each planning level within each agent, which are designed to optimise a joint objective function. Using concepts from decentralised Monte Carlo tree search (Dec-MCTS), the individual optimisations continuously exchange information about their current plans. This breaks the common top-down only information flow within the planning hierarchy and allows higher level optimisers to consider progress made by lower level planners. The method is implemented for survey missions using a fleet of ground robots. Simulation results with different mission profiles show substantial performance improvements of the new method of up to 59% compared to traditional MCTS and Dec-MCTS.
|
|
16:40-16:45, Paper WeET10.2 | |
Generalized Mission Planning for Heterogeneous Multi-Robot Teams Via LLM-Constructed Hierarchical Trees |
|
Gupta, Piyush | Honda Research Institute, US |
Isele, David | University of Pennsylvania, Honda Research Institute USA |
Sachdeva, Enna | Honda Research Institute |
Huang, Pin-Hao | Honda Research Institute |
Dariush, Behzad | Honda Research Institute USA |
Lee, Kwonjoon | Honda Research Institute USA |
Bae, Sangjae | Honda Research Institute, USA |
Keywords: Multi-Robot Systems, Task Planning, AI-Enabled Robotics
Abstract: We present a novel mission-planning strategy for heterogeneous multi-robot teams, taking into account the specific constraints and capabilities of each robot. Our approach employs hierarchical trees to systematically break down complex missions into manageable sub-tasks. We develop specialized APIs and tools, which are utilized by Large Language Models (LLMs) to efficiently construct these hierarchical trees. Once the hierarchical tree is generated, it is further decomposed to create optimized schedules for each robot, ensuring adherence to their individual constraints and capabilities. We demonstrate the effectiveness of our framework through detailed examples covering a wide range of missions, showcasing its flexibility and scalability.
|
|
16:45-16:50, Paper WeET10.3 | |
Efficient Coordination and Synchronization of Multi-Robot Systems under Recurring Linear Temporal Logic |
|
Peron, Davide | Università Degli Studi Di Padova |
Nan Fernandez-Ayala, Victor | KTH Royal Institute of Technology |
Vlahakis, Eleftherios E. | KTH Royal Institute of Technology |
Dimarogonas, Dimos V. | KTH Royal Institute of Technology |
Keywords: Cooperating Robots, Multi-Robot Systems, Task and Motion Planning
Abstract: We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.
|
|
16:50-16:55, Paper WeET10.4 | |
HULK: Large-Scale Hierarchical Coordination under Continual and Uncertain Temporal Tasks |
|
Luo, Qingyuan | Peking University |
Li, Jie | National University of Defense Technology |
Guo, Meng | Peking University |
Keywords: Multi-Robot Systems, Task and Motion Planning, Formal Methods in Robotics and Automation
Abstract: Multi-agent systems can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. Coordination of such teams often involves two aspects: (i) selecting appropriate subteams for different tasks in various areas; (ii) coordinating agents in the subteams to execute the associated subtasks. Existing work often assumes that the tasks are static and known beforehand, where an integer program can be formulated and solved offline.However, in many applications, the team-wise tasks are generated online continually by external requests; and the amount of subtasks within each task is uncertain (e.g., the number of packages to deliver, and victims to rescue). The aforementioned offline solution becomes inadequate as it would require constant re-computation for the whole team and global communication to broadcast the results. Thus, this work tackles the large-scale coordination problem under continual and uncertain temporal tasks, specified as temporal logic formulas over collaborative actions. The proposed hierarchical framework (HULK) consists of two interleaved layers: the rolling assignment of currently-known tasks to sub-teams within a certain horizon, and the dynamic coordination within a sub-team given the detected subtasks during online execution. Thus, the coordination is performed hierarchically at different granularities and triggering conditions, to improve the computational efficiency and robustness. It is validated rigorously over large-scale heterogeneous systems under various temporal tasks and environment uncertainties.
|
|
16:55-17:00, Paper WeET10.5 | |
COHERENT: Collaboration of Heterogeneous Multi-Robot System with Large Language Models |
|
Liu, Kehui | Northwestern Polytechnical University |
Tang, Zixin | National University of Defense Technology |
Wang, Dong | Shanghai Artificial Intelligence Laboratory |
Wang, Zhigang | Shanghai AI Laboratory |
Li, Xuelong | Northwestern Polytechnical University |
Zhao, Bin | Northwestern Polytechnical University |
Keywords: Multi-Robot Systems, Cooperating Robots
Abstract: Leveraging the powerful reasoning capabilities of large language models (LLMs), recent LLM-based robot task planning methods yield promising results. However, they mainly focus on single or multiple homogeneous robots on simple tasks. Practically, complex long-horizon tasks always require collaboration among multiple heterogeneous robots especially with more complex action spaces, which makes these tasks more challenging. To this end, we propose COHERENT, a novel LLM-based task planning framework for collaboration of heterogeneous multi-robot systems including quadrotors, robotic dogs, and robotic arms. Specifically, a Proposal-Execution-Feedback-Adjustment (PEFA) mechanism is designed to decompose and assign actions for individual robots, where a centralized task assigner makes a task planning proposal to decompose the complex task into subtasks, and then assigns subtasks to robot executors. Each robot executor selects a feasible action to implement the assigned subtask and reports self-reflection feedback to the task assigner for plan adjustment. The PEFA loops until the task is completed. Moreover, we create a challenging heterogeneous multi-robot task planning benchmark encompassing 100 complex long-horizon tasks. The experimental results show that our work surpasses the previous methods by a large margin in terms of success rate and execution efficiency. The experimental videos, code, and benchmark are released at https://github.com/MrKeee/COHERENT.
|
|
17:00-17:05, Paper WeET10.6 | |
LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner |
|
Zhang, Xiaopan | University of California - Riverside |
Qin, Hao | Pennsylvania State University |
Wang, Fuquan | University of California Riverside |
Dong, Yue | University of California Riverside |
Li, Jiachen | University of California, Riverside |
Keywords: Multi-Robot Systems, Cooperating Robots, AI-Enabled Robotics
Abstract: Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. Nevertheless, it remains a significant challenge to handle long-horizon tasks, especially in subtask identification and allocation for cooperative heterogeneous robot teams. To address this issue, we propose a Language Model-Driven Multi-Agent PDDL Planner (LaMMA-P), a novel multi-agent task planning framework that achieves state-of-the-art performance on long-horizon tasks. LaMMA-P integrates the strengths of the LMs’ reasoning capability and the traditional heuristic search planner to achieve a high success rate and efficiency while demonstrating strong generalization across tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that features household tasks with two different levels of complexity based on the AI2-THOR environment. The experimental results demonstrate that LaMMA-P achieves a 105% higher success rate and 36% higher efficiency than existing LM-based multi-agent planners. The experimental videos, code, datasets, and detailed prompts used in each module can be found on the project website: https://lamma-p.github.io.
|
|
17:05-17:10, Paper WeET10.7 | |
FlyKites: Human-Centric Interactive Exploration and Assistance under Limited Communication |
|
Zhang, Yuyang | Peking University |
Tian, Zhuoli | Peking University |
Wei, Jinsheng | Peking University |
Guo, Meng | Peking University |
Keywords: Multi-Robot Systems, Task and Motion Planning, Human-Robot Teaming
Abstract: Fleets of autonomous robots have been deployed for exploration of unknown scenes for features of interest, e.g., subterranean exploration, reconnaissance, search and rescue missions. During exploration, the robots may encounter un-identified targets, blocked passages, interactive objects, temporary failure, or other unexpected events, all of which require consistent human assistance with reliable communication for a time period. This however can be particularly challenging if the communication among the robots is severely restricted to only close-range exchange via ad-hoc networks, especially in extreme environments like caves and underground tunnels. This paper presents a novel human-centric interactive exploration and assistance framework called FlyKites, for multi-robot systems under limited communication. It consists of three interleaved components: (I) the distributed exploration and intermittent communication (called the ``spread mode"), where the robots collaboratively explore the environment and exchange local data among the fleet and with the operator; (II) the simultaneous optimization of the relay topology, the operator path, and the assignment of robots to relay roles (called the ``relay mode"), such that all requested assistance can be provided with minimum delay; (III) the human-in-the-loop online execution, where the robots switch between different roles and interact with the operator adaptively. Extensive human-in-the-loop simulations and hardware experiments are performed over numerous challenging scenes.
|
|
17:10-17:15, Paper WeET10.8 | |
Work Smarter Not Harder: Simple Imitation Learning with CS-PIBT Outperforms Large-Scale Imitation Learning for MAPF |
|
Veerapaneni, Rishi | Carnegie Mellon University |
Jakobsson, Arthur | Carnegie Mellon University |
Ren, Kevin | Carnegie Mellon University |
Kim, Samuel | Solon High School |
Li, Jiaoyang | Carnegie Mellon University |
Likhachev, Maxim | Carnegie Mellon University |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Motion and Path Planning, Imitation Learning
Abstract: Multi-Agent Path Finding (MAPF) is the problem of effectively finding efficient collision-free paths for a group of agents in a shared workspace. The MAPF community has largely focused on developing high-performance heuristic search methods. Recently, several works have applied various machine learning (ML) techniques to solve MAPF, usually involving sophisticated architectures, reinforcement learning techniques, and set-ups, but none using large amounts of high-quality supervised data. Our initial objective in this work was to show how simple large-scale imitation learning of high-quality heuristic search methods can lead to state-of-the-art ML MAPF performance. However, we find that, at least with our model architecture, simple large-scale (700k examples with hundreds of agents per example) imitation learning does not produce impressive results. Instead, we find that by using prior work that post-processes MAPF model predictions to resolve 1-step collisions (CS-PIBT), we can train a simple ML MAPF policy in minutes that dramatically outperforms existing ML MAPF policies. This has serious implications for all future ML MAPF policies (with local communication) which currently struggle to scale. In particular, this finding implies that future learnt policies should always (1) use smart 1-step collision shields (e.g. CS-PIBT) and (2) include the collision shield with greedy actions as a baseline (e.g. PIBT), as well as (3) motivates future models to focus on longer horizon / more complex planning as 1-step collisions can be efficiently resolved.
|
|
WeET11 |
314 |
Agile Legged Locomotion |
Regular Session |
Chair: Clark, Jonathan | Florida State University |
Co-Chair: Duong, Thai | Rice University |
|
16:35-16:40, Paper WeET11.1 | |
Mastering Agile Jumping Skills from Simple Practices with Iterative Learning Control |
|
Nguyen, Chuong | University of Southern California |
Bao, Lingfan | University College London |
Nguyen, Quan | University of Southern California |
Keywords: Legged Robots, Learning from Experience, Model Learning for Control
Abstract: Achieving precise target jumping with legged robots poses a significant challenge due to the long flight phase and the uncertainties inherent in contact dynamics and hardware. Forcefully attempting these agile motions on hardware could result in severe failures and potential damage. Motivated by this challenge, we propose an Iterative Learning Control (ILC) approach to learn and refine jumping skills from easy to difficult, instead of directly learning these challenging tasks. We verify that learning from simplicity can enhance safety and target jumping accuracy over trials. Compared to other ILC approaches for legged locomotion, our method can tackle the problem of a long flight phase where control input is not available. In addition, our approach allows the robot to apply what it learns from a simple jumping task to accomplish more challenging tasks within a few trials directly in hardware, instead of learning from scratch. We validate the method through extensive experiments on the A1 model and hardware for various tasks. Starting from a small jump (e.g., a forward jump 40cm), our learning approach empowers the robot to accomplish a variety of challenging targets, including jumping onto a 20cm high box, leaping to a greater distance of up to 60cm, as well as performing jumps while carrying an unknown payload of 2kg. Our framework allows the robot to reach the desired position and orientation targets with approximate errors of 1cm and 1 degree within a few trials.
|
|
16:40-16:45, Paper WeET11.2 | |
Agile Continuous Jumping in Discontinuous Terrains |
|
Yang, Yuxiang | Google Deepmind |
Shi, Guanya | Carnegie Mellon University |
Lin, Changyi | Carnegie Mellon University |
Meng, Xiangyun | University of Washington |
Scalise, Rosario | University of Washington |
Guaman Castro, Mateo | University of Washington |
Yu, Wenhao | Google |
Zhang, Tingnan | Google |
Zhao, Ding | Carnegie Mellon University |
Tan, Jie | Google |
Boots, Byron | University of Washington |
Keywords: Machine Learning for Robot Control, Reinforcement Learning, Legged Robots
Abstract: We focus on advancing the agility of quadrupedal robots with continuous, precise, and terrain-adaptive jumping in discontinuous terrains such as stairs and stepping stones. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Such a hierarchical and hybrid framework effectively combines the advantages of model-free learning and model-based control, therefore enabling a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step stair in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities.
|
|
16:45-16:50, Paper WeET11.3 | |
High Accuracy Aerial Maneuvers on Legged Robots Using Variational Integrator Discretized Trajectory Optimization |
|
Beck, Scott | University of Southern California |
Nguyen, Chuong | University of Southern California |
Duong, Thai | Rice University |
Atanasov, Nikolay | University of California, San Diego |
Nguyen, Quan | University of Southern California |
Keywords: Legged Robots, Optimization and Optimal Control
Abstract: Performing acrobatic maneuvers involving long aerial phases, such as precise dives or multiple backflips from significant heights, remains an open challenge in legged robot autonomy. Such aggressive motions often require accurate state predictions over long horizons with multiple contacts and extended flight phases. Most existing trajectory optimization (TO) methods rely on Euler or Runge-Kutta integration, which can accumulate significant prediction errors over long planning horizons. In this work, we propose a novel whole-body TO method using variational integration (VI) and full-body nonlinear dynamics for long-flight aggressive maneuvers. Compared to traditional Euler-based TO, our approach using VI preserves energy and momentum properties of the continuous time system and reduces error between predicted and executed trajectories by factors of between 2 − 10 while achieving similar planning time. We successfully demonstrate long-flight triple backflips on a quadruped A1 robot model and backflips on a bipedal HECTOR robot model for various heights and distances, achieving landing angle errors of only a few degrees. In contrast, TO with Euler integration fails to achieve accurate landings in equivalent circumstances, e.g., with landing angle errors greater than 90◦ for triple backflips. We provide an open-source implementation of our VI-discretized TO to support further research on accurate dynamic maneuvers for multi-rigid-body robot systems with contact: https://github.com/DRCL-USC/VI_discretized_TO
|
|
16:50-16:55, Paper WeET11.4 | |
Learn to Swim: Data-Driven LSTM Hydrodynamic Model for Quadruped Robot Gait Optimization |
|
Han, Fei | Westlake University |
Guo, Pengming | Westlake University |
Chen, Hao | Westlake University |
Li, Weikun | Westlake University |
Ren, Jingbo | Xinyang Normal University |
Liu, Naijun | Institute of Automation Chinese Academy of Sciences |
Yang, Ning | Institute of Automation, Chinese Academy of Sciences |
Fan, Dixia | Westlake University |
Keywords: Legged Robots, Model Learning for Control, Whole-Body Motion Planning and Control
Abstract: This paper presents a Long Short-Term Memory network-based Fluid Experiment Data-Driven model (FED-LSTM) for predicting unsteady, nonlinear hydrodynamic forces on the underwater quadruped robot we constructed. Trained on experimental data from leg force and body drag tests conducted in both a recirculating water tank and a towing tank, FED-LSTM outperforms traditional empirical formulas (EF) commonly used for flow prediction over flat surfaces. The model demonstrates superior accuracy and adaptability in capturing complex fluid dynamics, particularly in straight-line and turning-gait optimizations via the NSGA-II algorithm. FED-LSTM reduces deflection errors during straight-line swimming and improves turn times without increasing the turning radius. Hardware experiments further validate the model's precision and stability over EF. This approach provides a robust framework for enhancing the swimming performance of legged robots, laying the groundwork for future advances in underwater robotic locomotion.
|
|
16:55-17:00, Paper WeET11.5 | |
Stage-Wise Reward Shaping for Acrobatic Robots: A Constrained Multi-Objective Reinforcement Learning Approach |
|
Kim, Dohyeong | Seoul National University |
Kwon, Hyeokjin | Seoul National University |
Kim, Junseok | Seoul National University |
Lee, Gunmin | Seoul National University |
Oh, Songhwai | Seoul National University |
Keywords: Reinforcement Learning, Legged Robots, Robot Safety
Abstract: As the complexity of tasks addressed through reinforcement learning (RL) increases, the definition of reward functions also has become highly complicated. We introduce an RL method aimed at simplifying the reward-shaping process through intuitive strategies. Initially, instead of a single reward function composed of various terms, we define multiple reward and cost functions within a constrained multi-objective RL (CMORL) framework. For tasks involving sequential complex movements, we segment the task into distinct stages and define multiple rewards and costs for each stage. Finally, we introduce a practical CMORL algorithm that maximizes objectives based on these rewards while satisfying constraints defined by the costs. The proposed method has been successfully demonstrated across a variety of acrobatic tasks in both simulation and real-world environments. Additionally, it has been shown to successfully perform tasks compared to existing RL and constrained RL algorithms. Our code is available at https://github.com/rllab-snu/Stage-Wise-CMORL.
|
|
17:00-17:05, Paper WeET11.6 | |
Design and Implementation of a Swimming and Walking Quadruped for Seafloor Exploration |
|
Chase, Ashley | Florida State University |
Labiner, Benjamin | North Carolina State University |
Boylan, Jonathan | FAMU-FSU College of Engineering |
Ryals, Cameron | Florida State University |
Vranicar, Jack | Florida State University |
Dina, Michael | Florida State University |
Vasquez, Derek A. | Florida State University |
Seal, Dane | Florida State University |
Young, Charles | Florida State University |
St Laurent, Louis | University of Washington |
Ordonez, Camilo | Florida State University |
Clark, Jonathan | Florida State University |
Keywords: Legged Robots, Biologically-Inspired Robots, Marine Robotics
Abstract: The seafloor is a complex environment and it is challenging to conduct detailed mapping, soil composition sampling, and habitat characterization missions in this benthic region. As a step toward overcoming these challenges, we present a quadruped robot capable of walking on the seafloor and maneuvering via midfluid swimming. SELQIE, the Seafloor Environment Legged Quadruped Intelligent Explorer, is capable of walking underwater at speeds up to 0.2 m/s, swimming at over 0.16 m/s, and transitioning between modes. We also introduce a path planning algorithm that can account for both swimming and walking gaits to efficiently navigate around or over obstacles, and demonstrate the robot executing such a multi-modal trajectory.
|
|
17:05-17:10, Paper WeET11.7 | |
Beyond Robustness: Learning Unknown Dynamic Load Adaptation for Quadruped Locomotion on Rough Terrain |
|
Chang, Leixin | Zhejiang University |
Nai, Yuxuan | Zhejiang University |
Chen, Hua | Zhejiang University |
Yang, Liangjing | Zhejiang University |
Keywords: Reinforcement Learning, Legged Robots
Abstract: Unknown dynamic load carrying is one important practical application for quadruped robots. Such a problem is non-trivial, posing three major challenges in quadruped loco- motion control. First, how to model or represent the dynamics of the load in a generic manner. Second, how to make the robot capture the dynamics without any external sensing. Third, how to enable the robot to interact with load handling the mutual effect and stabilizing the load. In this work, we propose a general load modeling approach called load characteristics modeling to capture the dynamics of the load. We integrate this proposed modeling technique and leverage recent advances in Reinforcement Learning (RL) based locomotion control to enable the robot to infer the dynamics of load movement and interact with the load indirectly to stabilize it and realize the sim-to-real deployment to verify its effectiveness in real scenarios. We conduct extensive comparative simulation experiments to validate the effectiveness and superiority of our proposed method. Results show that our method outperforms other methods in sudden load resistance, load stabilizing and locomotion with heavy load on rough terrain.
|
|
17:10-17:15, Paper WeET11.8 | |
PIE: Parkour with Implicit-Explicit Learning Framework for Legged Robots |
|
Luo, Shixin | Zhejiang University |
Li, Songbo | Zhejiang University |
Yu, Ruiqi | Zhejiang University |
Wang, Zhicheng | Zhejiang University |
Wu, Jun | Zhejiang University |
Zhu, Qiuguo | Zhejiang University |
Keywords: Legged Robots, Reinforcement Learning, Deep Learning for Visual Perception
Abstract: Parkour presents a highly challenging task for legged robots, requiring them to traverse various terrains with agile and smooth locomotion. This necessitates comprehensive understanding of both the robot's own state and the surrounding terrain, despite the inherent unreliability of robot perception and actuation. Current state-of-the-art methods either rely on complex pre-trained high-level terrain reconstruction modules or limit the maximum potential of robot parkour to avoid failure due to inaccurate perception. In this paper, we propose a one-stage end-to-end learning-based parkour framework: Parkour with Implicit-Explicit learning framework for legged robots (PIE) that leverages dual-level implicit-explicit estimation. With this mechanism, even a low-cost quadruped robot equipped with an unreliable egocentric depth camera can achieve exceptional performance on challenging parkour terrains using a relatively simple training process and reward function. While the training process is conducted entirely in simulation, our real-world validation demonstrates successful zero-shot deployment of our framework, showcasing superior parkour performance on harsh terrains.
|
|
WeET12 |
315 |
Visual Servoing and Tracking |
Regular Session |
Chair: Chaumette, Francois | Inria Center at University of Rennes |
Co-Chair: Cheng, Sheng | University of Illinois Urbana-Champaign |
|
16:35-16:40, Paper WeET12.1 | |
Determination of All Stable and Unstable Equilibria for Image-Point-Based Visual Servoing |
|
Colotti, Alessandro | Centre Inria De l'Université De Rennes |
García Fontán, Jorge | Sorbonne Université |
Goldsztejn, Alexandre | CNRS IRCCyN |
Briot, Sébastien | LS2N |
Chaumette, Francois | Inria Center at University of Rennes |
Kermorgant, Olivier | École Centrale Nantes, Laboratoire Des Sciences Du Numérique De |
Safey El Din, Mohab | Sorbonne Univ |
Keywords: Visual Servoing, Formal Methods in Robotics and Automation, Sensor-based Control, Stability Analysis
Abstract: Local minima are a well-known drawback of image-based visual servoing systems. Up to now, there were no formal guarantees on their number, or even their existence, according to the considered configuration. In this work, a formal approach is presented for the exhaustive computation of all minima and unstable equilibria for a class of six well-known image- based visual servoing controllers. This approach relies on a new polynomial formulation of the equilibrium condition that avoids using the camera pose. By using modern computational algebraic geometry methods and an ad hoc symmetry breaking strategy, the formal resolution of this new equilibrium condition is rendered computationally feasible. The proposed methodology is applied to compute the equilibria of several classical visual servoing tasks, with planar and non-planar configurations of four and five points. The effects of local minima and saddle points on the dynamics of the system are finally illustrated through intensive simulation results, as well as the effects of image noise and uncertainties on depths.
|
|
16:40-16:45, Paper WeET12.2 | |
DiffTune: Auto-Tuning through Auto-Differentiation |
|
Cheng, Sheng | University of Illinois Urbana-Champaign |
Kim, Minkyung | University of Illinois Urbana-Champaign |
Song, Lin | UIUC |
Yang, Chengyu | University of Illinois Urbana-Champaign |
Jin, Yiquan | Zhejiang University |
Wang, Shenlong | University of Illinois at Urbana-Champaign |
Hovakimyan, Naira | University of Illinois at Urbana-Champaign |
Keywords: Control Architectures and Programming, Learning and Adaptive Systems, Aerial Systems: Mechanics and Control, auto-tuning
Abstract: The performance of robots in high-level tasks depends on the quality of their lower-level controller, which requires fine-tuning. However, the intrinsically nonlinear dynamics and controllers make tuning a challenging task when it is done by hand. We present DiffTune, a novel, gradient-based automatic tuning framework. We formulate the controller tuning as a parameter optimization problem and update the controller parameters through gradient-based optimization. The gradient is obtained using sensitivity propagation, which is the only method for gradient computation when tuning for a physical system instead of its simulated counterpart. Furthermore, we use L1 adaptive control to compensate for the uncertainties so that the gradient is not biased by the unmodelled uncertainties. We validate the DiffTune in simulation and compare it with state-of-the-art auto-tuning methods, where DiffTune achieves the best performance in a more efficient manner. Experiments on auto-tuning a nonlinear controller for quadrotor show promising results, where DiffTune achieves 3.5x tracking error reduction on an aggressive trajectory in only 10 trials over a 12-dimensional controller par
|
|
16:45-16:50, Paper WeET12.3 | |
Output Feedback with Feedforward Robust Control for Motion Systems Driven by Nonlinear Position-Dependent Actuators (I) |
|
Al Saaideh, Mohammad | Memorial University of Newfoundland |
Boker, Almuatazbellah | Virginia Tech |
Al Janaideh, Mohammad | University of Guelph |
Keywords: Actuation and Joint Mechanisms, Motion Control
Abstract: This paper introduces a control approach for a motion system driven by a class of actuators with multiple nonlinearities. The proposed approach presents a combination of a feedforward controller and an output feedback controller to enhance the tracking performance of the motion system. The feedforward controller is mainly proposed to address the actuator dynamics and provide a linearization of the actuator without requiring measurements from the actuator. Subsequently, the output feedback controller is designed using the measured position to achieve a tracking objective for a desired reference signal, considering the unknown nonlinearities in the system and the error due to the open-loop compensation using feedforward control. The efficacy of the proposed control approach is validated through three applications: reluctance actuator, electrostatic microactuator, and magnetic levitation system. Both simulation and experimental results demonstrate the effectiveness of the proposed control approach in achieving the desired reference signal with minimal tracking error, considering that the actuator and system nonlinearities are unknown.
|
|
16:50-16:55, Paper WeET12.4 | |
QP-Based Visual Servoing under Motion Blur-Free Constraint |
|
Robic, Maxime | University of Picardy Jules Verne |
Fraisse, Renaud | Airbus Defence & Space |
Marchand, Eric | Univ Rennes, Inria, CNRS, IRISA |
Chaumette, Francois | Inria Center at University of Rennes |
Keywords: Visual Servoing, Space Robotics and Automation, Visual Tracking
Abstract: This work proposes a QP-based visual servoing scheme for limiting motion blur during the achievement of a visual task. Unlike traditional image restoration approaches, we want to avoid any deconvolution step by keeping the image sequence acquired by the camera as sharp as possible. To do so, we select the norm of the image gradient as sharpness metric, from which we design a velocity constraint that is injected in a QP controller. Our system is evaluated for an Earth observation satellite. Simulation and experimental results show the effectiveness of our approach.
|
|
16:55-17:00, Paper WeET12.5 | |
FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality |
|
Ding, Junyuan | Beihang University |
Wang, Ziteng | DVSense (Beijing) Technology Co., Ltd., China |
Gao, Chang | Delft University of Technology |
Liu, Min | DVSense |
Chen, Qinyu | Leiden University |
Keywords: Deep Learning for Visual Perception, Gesture, Posture and Facial Expressions, Sensor-based Control
Abstract: Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6x and 1.8x compared to the prior art, EV-Eye, with 4.4x and 11.7x less parameters and arithmetic operations. The code is available at https://github.com/DeanJY/FACET.
|
|
17:00-17:05, Paper WeET12.6 | |
EMoE-Tracker: Environmental MoE-Based Transformer for Robust Event-Guided Object Tracking |
|
Chen, Yucheng | Hong Kong University of Science and Technology (GZ) |
Wang, Lin | Nanyang Technological University (NTU) |
Keywords: Visual Tracking, Sensor Fusion, Deep Learning for Visual Perception
Abstract: The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, no interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features for better interaction and discriminability between the target information and background. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Gating to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that fine-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to improve interaction and discriminability between the target information and background. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.
|
|
WeET13 |
316 |
Manipulating Challenging Objects |
Regular Session |
Chair: Khan, Shiraz | University of Delaware |
|
16:35-16:40, Paper WeET13.1 | |
Learning Keypoints for Robotic Cloth Manipulation Using Synthetic Data |
|
Lips, Thomas | Ghent University |
De Gusseme, Victor-Louis | Ghent University |
Wyffels, Francis | Ghent University |
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, Simulation and Animation
Abstract: Assistive robots should be able to wash, fold or iron clothes. However, due to the variety, deformability and self-occlusions of clothes, creating robot systems for cloth manipulation is challenging. Synthetic data is a promising direction to improve generalization, but the sim-to-real gap limits its effectiveness. To advance the use of synthetic data for cloth manipulation tasks such as robotic folding, we present a synthetic data pipeline to train keypoint detectors for almost- flattened cloth items. To evaluate its performance, we have also collected a real-world dataset. We train detectors for both T-shirts, towels and shorts and obtain an average precision of 64% and an average keypoint distance of 18 pixels. Fine-tuning on real-world data improves performance to 74% mAP and an average distance of only 9 pixels. Furthermore, we describe failure modes of the keypoint detectors and compare different approaches to obtain cloth meshes and materials. We also quantify the remaining sim- to-real gap and argue that further improvements to the fidelity of cloth assets will be required to further reduce this gap. The code, dataset and trained models are available online.
|
|
16:40-16:45, Paper WeET13.2 | |
RaggeDi: Diffusion-Based State Estimation of Disordered Rags, Sheets, Towels and Blankets |
|
Ye, Jikai | National University of Singapore |
Li, Wanze | Nation University of Singapore |
Khan, Shiraz | University of Delaware |
Chirikjian, Gregory | University of Delaware |
Keywords: Deep Learning for Visual Perception, RGB-D Perception, Visual Tracking
Abstract: Cloth state estimation is an important problem in robotics. It is essential for the robot to know the accurate state to manipulate cloth and execute tasks such as robotic dressing, stitching, and covering/uncovering human beings. However, estimating cloth state accurately remains challenging due to its high flexibility and self-occlusion. This paper proposes a diffusion model-based pipeline that formulates the cloth state estimation as an image generation problem by representing the cloth state as an RGB image that describes the point-wise translation (translation map) between a pre-defined flattened mesh and the deformed mesh in a canonical space. Then we train a conditional diffusion-based image generation model to predict the translation map based on an observation. Experiments are conducted in both simulation and the real world to validate the performance of our method. Results indicate that our method outperforms two recent methods in both accuracy and speed.
|
|
16:45-16:50, Paper WeET13.3 | |
Excavating in the Wild: The GOOSE-Ex Dataset for Semantic Segmentation |
|
Hagmanns, Raphael | Karlsruhe Institute of Technology |
Mortimer, Peter | Universität Der Bundeswehr München |
Granero, Miguel | Fraunhofer IOSB |
Luettel, Thorsten | Universität Der Bundeswehr München |
Petereit, Janko | Fraunhofer IOSB |
Keywords: Data Sets for Robotic Vision, Field Robots, Deep Learning for Visual Perception
Abstract: The successful deployment of deep learning-based techniques for autonomous systems is highly dependent on the data availability for the respective system in its deployment environment. Especially for unstructured outdoor environments, very few datasets exist for even fewer robotic platforms and scenarios. In an earlier work, we presented the German Outdoor and Offroad Dataset (GOOSE) framework along with 10000 multimodal frames from an offroad vehicle to enhance the perception capabilities in unstructured environments. In this work, we address the generalizability of the GOOSE framework. To accomplish this, we open-source the GOOSE-Ex dataset, which contains additional 5000 labeled multimodal frames from various completely different environments, recorded on a robotic excavator and a quadruped platform. We perform a comprehensive analysis of the semantic segmentation performance on different platforms and sensor modalities in unseen environments. In addition, we demonstrate how the combined datasets can be utilized for different downstream applications or competitions such as offroad navigation, object manipulation or scene completion. The dataset, its platform documentation and pre-trained state-of-the-art models for offroad perception will be made available on https://goose-dataset.de/.
|
|
16:50-16:55, Paper WeET13.4 | |
Robotic Framework for Iterative and Adaptive Profile Grading of Sand |
|
Hanut, Louis | KU Leuven |
Du, Yurui | KU Leuven |
Vande Moere, Andrew | KU Leuven |
Detry, Renaud | KU Leuven |
Bruyninckx, Herman | KU Leuven |
Keywords: Robotics and Automation in Construction, Robust/Adaptive Control
Abstract: This paper studies sand profile grading, a manipulation task to obtain a desired geometric curve in sand. Manipulating sand is challenging because like other amorphous materials, its properties are difficult to estimate and emergent effects such as collapses may occur which both influence the manipulation outcome. To tackle these challenges, humans iterate and adapt their manual actions to the observed material states. In this paper, we propose to replicate this adaptive and iterative approach on a robotic profile grading task. Our results demonstrate that (1) tool insertion adaptation reduces force limit violations during tool-material interactions, (2) grading angle adaptation ensures no undercutting or collisions while allowing for cutting or smoothing the sand profile, and (3) adapting progress speed to task evolution provides a balance between grading precision and execution time. This paper’s findings pave the way for generalized and transferable robotic systems manipulating various amorphous materials and automating a larger set of construction tasks and beyond.
|
|
16:55-17:00, Paper WeET13.5 | |
Autonomous Excavation of Challenging Terrain Using Oscillatory Primitives and Adaptive Impedance Control |
|
Franceschini, Noah | University of Illinois Urbana-Champaign |
Thangeda, Pranay | University of Illinois Urbana-Champaign |
Ornik, Melkior | University of Illinois Urbana-Champaign |
Hauser, Kris | University of Illinois at Urbana-Champaign |
Keywords: Robotics and Automation in Construction, Compliance and Impedance Control, Mining Robotics
Abstract: This paper addresses the challenge of autonomous excavation of challenging terrains, in particular those that are prone to jamming and inter-particle adhesion when tackled by a standard penetrate-drag-scoop motion pattern. Inspired by human excavation strategies, our approach incorporates oscillatory rotation elements -- including swivel, twist, and dive motions -- to break up compacted, tangled grains and reduce jamming. We also present an adaptive impedance control method, the Reactive Attractor Impedance Controller (RAIC), that adapts a motion trajectory to unexpected forces during loading in a manner that tracks a trajectory closely when loads are low, but avoids excessive loads when significant resistance is met. Our method is evaluated on four terrains using a robotic arm, demonstrating improved excavation performance across multiple metrics, including volume scooped, protective stop rate, and trajectory completion percentage.
|
|
17:00-17:05, Paper WeET13.6 | |
Diffusion-Based Self-Supervised Imitation Learning from Imperfect Visual Servoing Demonstrations for Robotic Glass Installation |
|
Xiao, Canran | Central South University |
Hou, Liwei | Central South University |
Fu, Ling | Zoomlion |
Chen, Wenrui | Hunan University |
Keywords: Robotics and Automation in Construction, AI-Based Methods, Imitation Learning
Abstract: Heavy-duty glass installation is a high-risk, precision-critical task in modern construction, traditionally performed through labor-intensive and error-prone manual methods. This paper presents a novel robotic framework that leverages diffusion-based self-supervised imitation learning from imperfect visual servoing demonstrations to achieve safe and precise glass installation. Specifically, our approach employs noisy and suboptimal demonstration data obtained via visual servoing to train a Denoising Diffusion Probabilistic Model (DDPM). This model iteratively refines installation trajectories, transforming them into smooth, precise, and collision-free movements. Extensive experiments demonstrate that our method significantly surpasses conventional visual servoing and standard imitation learning baselines in terms of success rate, precision, and installation efficiency, while markedly improving operational safety. Our results establish a new benchmark for automating complex, high-risk tasks in construction robotics.
|
|
17:05-17:10, Paper WeET13.7 | |
A Global-Local Graph Attention Network for Deformable Linear Objects Dynamic Interaction with Environment |
|
Chu, Jian | Hefei University of Technology |
Zhang, Wenkang | Anhui Agricultural University |
Ouyang, Bo | Hefei University of Technology |
Tian, Kunmiao | Hefei University of Technology |
Zhang, Shuai | Hefei University of Technology |
Zhai, Kai | Hefei University of Technology |
Keywords: Dynamics, Collision Avoidance, Deep Learning Methods
Abstract: Accurately modeling the interactions between deformable linear objects (DLOs) and their environments is crucial for active deformation control by robot manipulators. Graph Neural Networks (GNNs) have shown immense potential in particle-based simulation of DLOs. However, most existing studies propagate particle information in sequence, ignoring that particle motions, including the distal particle, correlate strongly with each other and the interaction state. In this paper, a global and local attention dynamic simulation model named GladSim is designed based on GNNs and the attention mechanism to aggregate information among particles and focus on the interaction particles for DLO interaction with the environment. Specifically, a global virtual node is proposed to deliver particle information and shorten the propagation path for the first time, which connects all the particles and aggregates global information. When the DLOs and the obstacle boundary particles are close, an edge is established between them to capture the interaction state. Moreover, we group all the particles by k-hop neighbors and design a HopSA module that combines hop attention and self-attention to discover the correlates among adjacent particles. Experimental results on simulation and real-world data show that the proposed GladSim network's predictive accuracy significantly outperforms baseline models, especially in long-term prediction.
|
|
WeET14 |
402 |
Social Navigation 2 |
Regular Session |
Chair: Kosecka, Jana | George Mason University |
|
16:35-16:40, Paper WeET14.1 | |
Generating Causal Explanations of Vehicular Agent Behavioural Interactions with Learnt Reward Profiles |
|
Howard, Rhys Peter Matthew | University of Oxford |
Hawes, Nick | University of Oxford |
Kunze, Lars | University of Oxford |
Keywords: Intelligent Transportation Systems, AI-Enabled Robotics, Agent-Based Systems
Abstract: Transparency and explainability are important features that responsible autonomous vehicles should possess, particularly when interacting with humans, and causal reasoning offers a strong basis to provide these qualities. However, even if one assumes agents act to maximise some concept of reward, it is difficult to make accurate causal inferences of agent planning without capturing what is of importance to the agent. Thus our work aims to learn a weighting of reward metrics for agents such that explanations for agent interactions can be causally inferred. We validate our approach quantitatively and qualitatively across three real-world driving datasets, demonstrating a functional improvement over previous methods and competitive performance across evaluation metrics.
|
|
16:40-16:45, Paper WeET14.2 | |
Fast Online Learning of CLiFF-Maps in Changing Environments |
|
Zhu, Yufei | Örebro University |
Rudenko, Andrey | Robert Bosch GmbH |
Palmieri, Luigi | Robert Bosch GmbH |
Heuer, Lukas | Örebro University, Robert Bosch GmbH |
Lilienthal, Achim J. | Orebro University |
Magnusson, Martin | Örebro University |
Keywords: Human Detection and Tracking
Abstract: Maps of dynamics are effective representations of motion patterns learned from prior observations, with recent research demonstrating their ability to enhance various downstream tasks such as human-aware robot navigation, long-term human motion prediction, and robot localization. Current advancements have primarily concentrated on methods for learning maps of human flow in environments where the flow is static, i.e., not assumed to change over time. In this paper we propose an online update method of the CLiFF-map (an advanced map of dynamics type that models motion patterns as velocity and orientation mixtures) to actively detect and adapt to human flow changes. As new observations are collected, our goal is to update a CLiFF-map to effectively and accurately integrate them, while retaining relevant historic motion patterns. The proposed online update method maintains a probabilistic representation in each observed location, updating parameters by continuously tracking sufficient statistics. In experiments using both synthetic and real-world datasets, we show that our method is able to maintain accurate representations of human motion dynamics, contributing to high performance flow-compliant planning downstream tasks, while being orders of magnitude faster than the comparable baselines.
|
|
16:45-16:50, Paper WeET14.3 | |
A Hybrid Approach to Indoor Social Navigation: Integrating Reactive Local Planning and Proactive Global Planning |
|
Debnath, Arnab | George Mason University |
Stein, Gregory | George Mason University |
Kosecka, Jana | George Mason University |
Keywords: Human-Aware Motion Planning, Collision Avoidance
Abstract: We consider the problem of indoor building-scale social navigation, where the robot must reach a point goal as quickly as possible without colliding with humans who are freely moving around. Factors such as varying crowd densities, unpredictable human behavior, and the constraints of indoor spaces add significant complexity to the navigation task, necessitating a more advanced approach. We propose a modular navigation framework that leverages the strengths of both classical methods and deep reinforcement learning (DRL). Our approach employs a global planner to generate waypoints, assigning soft costs around anticipated pedestrian locations, encouraging caution around potential future positions of humans. Simultaneously, the local planner, powered by DRL, follows these waypoints while avoiding collisions. The combination of these planners enables the agent to perform complex maneuvers and effectively navigate crowded and constrained environments while improving reliability. Many existing studies on social navigation are conducted in simplistic or open environments, limiting the ability of trained models to perform well in complex, real-world settings. To advance research in this area, we introduce a new 2D benchmark designed to facilitate development and testing of social navigation strategies in indoor environments. We benchmark our method against traditional and RL-based navigation strategies, demonstrating that our approach outperforms both.
|
|
16:50-16:55, Paper WeET14.4 | |
Overlapping Social Navigation Principles: A Framework for Social Robot Navigation |
|
Ikeda, Bryce | University of North Carolina Chapel Hill |
Higger, Mark | Colorado School of Mines |
Song, Christina Soyoung | Illinois State University |
Trafton, Greg | Naval Research Laboratory |
Keywords: Social HRI, Human-Aware Motion Planning
Abstract: As autonomous robots become integrated into society, they must socially navigate around humans. We propose that effective social robot navigation relies on three key principles: social norms, perceived safety, and legibility. Our framework, Overlapping Social Navigation Principles, suggests that the strength of each principle is influenced by the presence of other principles. To test our framework, we implemented SRN behaviors on an autonomous robot in a passing scenario and conducted an online study where participants ranked videos of different SRN behavior combinations. Our findings show that incorporating all three principles enhances SRN, with social norms having the greatest impact.
|
|
16:55-17:00, Paper WeET14.5 | |
Relative Velocity-Based Reward Model for Socially-Aware Navigation with Deep Reinforcement Learning |
|
Maddumage, Vinu Vihan | University of Technology Sydney |
Kodagoda, Sarath | University of Technology, Sydney |
Carmichael, Marc | Centre for Autonomous Systems |
Gunatilake, Amal | University of Technology Sydney |
Thiyagarajan, Karthick | University of Technology Sydney |
Martin, Jodi | Guide Dogs NSW/ACT |
Keywords: Human-Aware Motion Planning, Social HRI, Collision Avoidance
Abstract: Mobile robots are increasingly deployed in shared environments where they must learn to navigate alongside humans. Deep Reinforcement Learning (DRL) techniques have shown promise in developing navigation policies that account for interactions within crowds, fostering socially acceptable movement. However, these techniques often depend heavily on collision avoidance rewards to ensure safe navigation. In this study, we introduce a novel reward component based on relative velocity for collision avoidance, which integrates both the robot’s and humans’ kinematics within personal distance constraints. We conducted a thorough evaluation comparing this new reward model against a conventional one in simulated environments using advanced DRL methods. Our findings indicate that the proposed reward model improves the robots’ ability to avoid collisions and navigate towards their goals while being socially acceptable.
|
|
17:00-17:05, Paper WeET14.6 | |
SICNav: Safe and Interactive Crowd Navigation Using Model Predictive Control and Bilevel Optimization |
|
Samavi, Sepehr | University of Toronto |
Han, James | University of Toronto |
Shkurti, Florian | University of Toronto |
Schoellig, Angela P. | TU Munich |
Keywords: Social Navigation, Collision Avoidance, Autonomous Vehicle Navigation, Optimization and Optimal Control
Abstract: Robots need to predict and react to human motions to navigate through a crowd without collisions. Many existing methods decouple prediction from planning, which does not account for the interaction between robot and human motions and can lead to the robot getting stuck. We propose SICNav, a Model Predictive Control (MPC) method that jointly solves for robot motion and predicted crowd motion in closed-loop. We model each human in the crowd to be following an Optimal Reciprocal Collision Avoidance (ORCA) scheme and embed that model as a constraint in the robot’s local planner, resulting in a bilevel nonlinear MPC optimization problem. We use a KKT- reformulation to cast the bilevel problem as a single level and use a nonlinear solver to optimize. Our MPC method can influence pedestrian motion while explicitly satisfying safety constraints in a single-robot multi-human environment. We analyze the performance of SICNav in two simulation environments and indoor experiments with a real robot to demonstrate safe robot motion that can influence the surrounding humans. We also validate the trajectory forecasting performance of ORCA on a human trajectory dataset.
|
|
WeET15 |
403 |
Surgical Robotics: Systems |
Regular Session |
Chair: Arai, Fumihito | The University of Tokyo |
Co-Chair: Zefran, Milos | University of Illinois at Chicago |
|
16:35-16:40, Paper WeET15.1 | |
Autonomous Continuous Capsulorhexis Based on a Force-Vision-Guided Robot System |
|
Liang, Hongli | Sun Yat-Sen University |
Liu, Jiali | Zhongshan Ophthalmic Center, Sun Yat-Sen University |
Nasseri, M. Ali | Technische Universitaet Muenchen |
Lin, Haotian | Sun Yat-Sen University, Zhongshan Ophthalmic Center |
Huang, Kai | Sun Yat-Sen University |
Keywords: Medical Robots and Systems
Abstract: Capsulorhexis is challenging in cataract surgery, since the size, centering, and circularity of the capsule are important. Those indicators are closely related to the subsequent step of phacoemulsification and the postoperative position of the intraocular lens. It takes 3-5 years for a resident to practice, while the occurrence of deficient capsulorhexis is still inevitable. This paper proposes a robotic system to automate Continuous Curvilinear Capsulorhexis(CCC) in cataract surgery. A typical ophthalmic microscope system and a triaxial force sensor are utilized to guide the robot system with a force-vision method. The constraint of a Remote Center of Motion (RCM) is designed to perform the surgery route. The experimental results on ex-vivo porcine eyes show our autonomous method can achieve a satisfactory 6mm capsule. With an average centering deviation below 76% and circularity of 0.993, the consistency of the capsulorhexis is comparable to a surgeon-made one.
|
|
16:40-16:45, Paper WeET15.2 | |
Ultrasound-Guided Robotic Blood Drawing and in Vivo Studies on Submillimetre Vessels of Rats |
|
Jing, Shuaiqi | Chengdu Aixam Medical Technology Co., Ltd |
Yao, Tianliang | Tongji University |
Zhang, Ke | Chengdu Aixam Medical Technology Co. Ltd |
Wu, Di | University of Southern Denmark |
Wang, Qiulin | Chengdu Aixam Medical Technology Co., Ltd |
Chen, Zixi | Scuola Superiore Sant'Anna |
Chen, Ke | Chengdu Aixam Medical Technology Co., Ltd |
Qi, Peng | Tongji University |
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Service Robotics
Abstract: Billions of vascular access procedures are performed annually worldwide, serving as a crucial first step in various clinical diagnostic and therapeutic procedures. For pediatric or elderly individuals, whose vessels are small in size (typically 2 to 3 mm in diameter for adults and <1 mm in children), vascular access can be highly challenging. This study presents an image-guided robotic system aimed at enhancing the accuracy of difficult vascular access procedures. The system integrates a 6-DoF (Degrees of Freedom) robotic arm with a 3-DoF end-effector, ensuring precise navigation and needle insertion. Multi-modal imaging and sensing technologies have been utilized to endow the medical robot with precision and safety, while ultrasound (US) imaging guidance is specifically evaluated in this study. To evaluate in vivo vascular access in submillimeter vessels, we conducted ultrasound-guided robotic blood drawing on the tail veins (with a diameter of 0.7 ± 0.2 mm) of 40 rats. The results demonstrate that the system achieved a first-attempt success rate of 95%. The high first-attempt success rate in intravenous vascular access, even with small blood vessels, demonstrates the system’s effectiveness in performing these procedures. This capability reduces the risk of failed attempts, minimizes patient discomfort, and enhances clinical efficiency.
|
|
16:45-16:50, Paper WeET15.3 | |
Sensory Glove-Based Surgical Robot User Interface |
|
Borgioli, Leonardo | University of Illinois Chicago |
Oh, Ki-Hwan | University of Illinois at Chicago |
Valle, Valentina | Surgical Innovation and Training Lab, Department of Surgery, Col |
Ducas, Alvaro | Surgical Innovation and Training Lab, Department of Surgery, Col |
Mohammad Halloum, Mohammad Halloum | Surgical Innovation and Training Lab, Department of Surgery, Col |
Diego Federico Mendoza Medina, Diego Federico Mendoza Medina | Surgical Innovation and Training Lab, Department of Surgery, Col |
Lopez, Paula | Surgical Innovation and Training Lab, Department of Surgery, Col |
Arman Sharifi, Arman Sharifi | Surgical Innovation and Training Lab, Department of Surgery, Col |
Cassiani, Jessica | Surgical Innovation and Training Lab, Department of Surgery, Col |
Zefran, Milos | University of Illinois at Chicago |
Chen, Liaohai | Surgical Innovation and Training Lab, Department of Surgery, Col |
Giulianotti, Pier Cristoforo | Surgical Innovation and Training Lab, Department of Surgery, Col |
Keywords: Surgical Robotics: Laparoscopy, Medical Robots and Systems, Telerobotics and Teleoperation
Abstract: Robotic surgery has reached a high level of maturity and has become an integral part of standard surgical care. However, existing surgeon consoles are bulky and take up valuable space in the operating room, present challenges for surgical team coordination, and their proprietary nature makes it difficult to take advantage of recent technological advances, especially in virtual and augmented reality. One potential area for further improvement is the integration of modern sensory gloves into robotic platforms, allowing surgeons to control robotic arms intuitively with their hand movements. We propose one such system that combines an HTC Vive tracker, a Manus Meta Prime 3 XR sensory glove, and SCOPEYE wireless smart glasses. The system controls one arm of a da Vinci surgical robot. In addition to moving the arm, the surgeon can use fingers to control the end-effector of the surgical instrument. Hand gestures are used to implement clutching and similar functions. In particular, we introduce clutching of the instrument orientation, a functionality unavailable in the da Vinci system. The vibrotactile elements of the glove are used to provide feedback to the user when gesture commands are invoked. A qualitative and quantitative evaluation has been conducted comparing the current device to the dVRK console; the system shows that it has excellent tracking accuracy and allows surgeons to efficiently perform common surgical training tasks with minimal practice with the new interface.
|
|
16:50-16:55, Paper WeET15.4 | |
Self-Deformable Magnetic Miniature Robot for Traction Assistance in Endoscopic Submucosal Dissection |
|
Zhang, Bolan | The University of Tokyo |
Yamanaka, Toshiro | The University of Tokyo |
Shu, Tengo | The University of Tokyo |
Liu, Yuxuan | The University of Tokyo |
Arai, Fumihito | The University of Tokyo |
Keywords: Medical Robots and Systems, Soft Robot Applications
Abstract: Between 1999 and 2020, gastrointestinal cancers were responsible for over three million deaths, emphasizing the critical role of minimally invasive surgical techniques like Endoscopic Submucosal Dissection (ESD) in managing such life-threatening conditions. ESD, which dissects the connective tissue between the mucosal and muscular layers using an electrosurgical knife connected to an endoscope, requires a constant traction force to stabilize tissues and expose underlying anatomical structures. This paper introduces a miniature magnetic flexible robot, actuated by a permanent magnet on a robotic manipulator, designed to enhance ESD by providing traction forces consistently on lesions. The robot was fabricated by casting magnetic silicone composites, and its safe deployment through the endoscope instrument channel was successfully demonstrated, avoiding tissue contact. Experiments in a rubber intestine model validated the feasibility of providing constant traction and 2 DOF orientation control via the robot, allowing real-time fine-tuning of the force direction. This reduces the difficulty and improves the precision and safety of ESD. This research presents a practical method for achieving stable force output in medical miniature robots, particularly in gastrointestinal procedures.
|
|
16:55-17:00, Paper WeET15.5 | |
Variable-Stiffness Nasotracheal Intubation Robot with Passive Buffering: A Modular Platform in Mannequin Studies |
|
Hao, Ruoyi | The Chinese University of Hong Kong |
Lai, Jiewen | The Chinese University of Hong Kong |
Zhong, Wenqi | The Chinese University of Hong Kong |
Xie, Dihong | The Chinese University of Hong Kong |
Tian, Yu | The Chinese University of Hong Kong |
Zhang, Tao | Chinese University of Hong Kong |
Zhang, Yang | Hubei University of Technology |
Chan, Catherine Po Ling | The Chinese University of Hong Kong |
Chan, Jason Ying-Kuen | The Chinese University of Hong Kong |
Ren, Hongliang | Chinese Univ Hong Kong (CUHK) & National Univ Singapore(NUS) |
Keywords: Medical Robots and Systems, Telerobotics and Teleoperation, Mechanism Design
Abstract: Intubation is a critical medical procedure for securing airway patency in patients, but the inconsistent skill levels among medical practitioners necessitate the advancement of better robotic solutions. While orotracheal intubation robots have been widely developed, nasotracheal intubation remains essential in specific clinical scenarios. However, nasotracheal intubation robots are still underdeveloped and lack buffer protection mechanisms to ensure safety. This study presents a novel variable-stiffness nasotracheal intubation robot (NIR) with passive buffering. The proposed NIR is a modular platform capable of performing the main steps of nasotracheal intubation, validated through mannequin studies via teleoperation. We proposed a variable-stiffness fiberoptic bronchoscope (FOB) control module for the FOB distal end control, and validated its dual functionality in experiments: low-stiffness mode provides passive buffering during nasal cavity navigation, with a frontal peak force of 2.8 N and a lateral peak force of 0.12 N; high-stiffness mode enhances load-bearing capacity for near-glottis navigation, with a frontal bearing force of 4.9 N and a lateral bearing force of 0.42 N. Additionally, a compact (74 × 64 × 53 mm, 150 g) FOB feeding module with passive failure protection was designed to limit the max frontal impact force to 2.3 N.
|
|
17:00-17:05, Paper WeET15.6 | |
SurgPose: A Dataset for Articulated Robotic Surgical Tool Pose Estimation and Tracking |
|
Wu, Zijian | The University of British Columbia |
Schmidt, Adam | Intuitive Surgical |
Moore, Randy | The University of British Columbia |
Zhou, Haoying | Worcester Polytechnic Institute |
Banks, Alexandre | University of New Brunswick |
Kazanzides, Peter | Johns Hopkins University |
Salcudean, Septimiu E. | University of British Columbia |
Keywords: Data Sets for Robotic Vision, Surgical Robotics: Laparoscopy, Computer Vision for Medical Robotics
Abstract: Accurate and efficient surgical robotic tool pose estimation is of fundamental significance to downstream applications such as augmented reality (AR) in surgical training and learning-based autonomous manipulation. While significant advancements have been made in pose estimation for humans and animals, it is still a challenge in surgical robotics due to the scarcity of published data. The relatively large absolute error of the da Vinci end effector kinematics and arduous calibration procedure make calibrated kinematics data collection expensive. Driven by this limitation, we collected a dataset, dubbed SurgPose, providing instance-aware semantic keypoints for visual surgical tool pose estimation and tracking. By marking keypoints using ultraviolet (UV) reactive paint, which is invisible under white light and fluorescent under UV light, we execute the same trajectory under different lighting conditions to collect raw videos and keypoint annotations, respectively. The SurgPose dataset consists of approximately 120K surgical instrument instances of 6 categories as shown in Fig. 1. Since the videos are collected in stereo pairs, the 2D pose can be lifted to 3D based on stereo-matching depth. In addition to releasing the dataset, we tested a few baseline approaches to surgical instrument tracking to demonstrate the utility of SurgPose. More details can be found at surgpose.github.io.
|
|
17:05-17:10, Paper WeET15.7 | |
On High Performance Control of Concentric Tube Continuum Robots through Parsimonious Calibration |
|
Boyer, Quentin | UBFC |
Voros, Sandrine | TIMC-IMAG Laboratory |
Roux, Pierre | FEMTO-ST Institute |
Marionnet, François | FEMTO-ST Institute |
Rabenorosoa, Kanty | Univ. Bourgogne Franche-Comté, CNRS |
Chikhaoui, M. Taha | CNRS - Univ. Grenoble Alpes |
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Calibration and Identification
Abstract: Continuum robots deform continuously, compared to conventional robots composed of rigid links and joints, and require dedicated calibration methods. Indeed, calibration is an essential step to obtain high performance control, as it directly influences robot accuracy. In this paper, we investigate how model parameters influence both model accuracy and model-based closed-loop control accuracy of Concentric Tube Continuum Robots (CTCR). A fast, robust, and real-time implementation of the Cosserat rod model is first introduced. Then, a model-based Jacobian control scheme is presented. A parsimonious calibration procedure focused on control accuracy is finally proposed to achieve submillimetric tracking errors along a 3D trajectory at velocity reaching 5 mm/s in complex scenarios including actuation constraints, obstacle avoidance, and external forces. Results are demonstrated both in simulation and on an experimental setup of a 3-tube CTCR.
|
|
WeET16 |
404 |
Deformable Objects |
Regular Session |
Chair: Li, Yunzhu | Columbia University |
Co-Chair: Iordachita, Ioan Iulian | Johns Hopkins University |
|
16:35-16:40, Paper WeET16.1 | |
Deformation Control of a 3D Soft Object Using RGB-D Visual Servoing and FEM-Based Dynamic Model |
|
Ouafo Fonkoua, Mandela | Inria Centre at Rennes University |
Chaumette, Francois | Inria Center at University of Rennes |
Krupa, Alexandre | Centre Inria De l'Université De Rennes |
Keywords: Visual Servoing, Dexterous Manipulation
Abstract: In this letter, we present a visual control framework for accurately positioning feature points belonging to the surface of a 3D deformable object to desired 3D positions, by acting on a set of manipulated points using a robotic manipulator. Notably, our framework considers the dynamic behavior of the object deformation, that is, we do not assume that the object is in its static equilibrium during the manipulation. By relying on a coarse dynamic Finite Element Model (FEM), we have successfully formulated the analytical relationship expressing the motion of the feature points to the six degrees of freedom (6~DOF) motion of a robot gripper. From this modeling step, a novel closed-loop deformation controller is designed. To be robust against model approximations, the whole shape of the object is tracked in real-time using an RGB-D camera, thus allowing to correct any drift between the object and its model on-the-fly. Our model-based and vision-based controller has been validated in real experiments. The results highlight the effectiveness of the proposed methodology.
|
|
16:40-16:45, Paper WeET16.2 | |
Real-Time Deformation-Aware Control for Autonomous Robotic Subretinal Injection Based on OCT Guidance |
|
Arikan, Demir | Technical University Munich |
Zhang, Peiyao | Johns Hopkins University |
Sommersperger, Michael | Technical University of Munich |
Dehghani, Shervin | TUM |
Esfandiari, Mojtaba | Johns Hopkins University |
Taylor, Russell H. | The Johns Hopkins University |
Nasseri, M. Ali | Technische Universitaet Muenchen |
Gehlbach, Peter | Johns Hopkins Medical Institute |
Navab, Nassir | TU Munich |
Iordachita, Ioan Iulian | Johns Hopkins University |
Keywords: Vision-Based Navigation, Medical Robots and Systems, Computer Vision for Medical Robotics
Abstract: Robotic platforms provide consistent and precise tool positioning that significantly enhances retinal microsurgery. Integrating such systems with intraoperative optical coherence tomography (iOCT) enables image-guided robotic interventions, allowing autonomous performance of advanced treatments, such as injecting therapeutic agents into the subretinal space. However, tissue deformations due to tool-tissue interactions constitute a significant challenge in autonomous iOCT-guided robotic subretinal injections. Such interactions impact correct needle positioning and procedure outcomes. This paper presents a novel method for autonomous subretinal injection under iOCT guidance that considers tissue deformations during the insertion procedure. The technique is achieved through real-time segmentation and 3D reconstruction of the surgical scene from densely sampled iOCT B-scans, which we refer to as B 5-scans. Using B 5-scans we monitor the position of the instrument relative to a virtual target layer between the ILM and RPE. Our experiments on ex-vivo porcine eyes demonstrate dynamic adjustment of the insertion depth and overall improved accuracy in needle positioning compared to prior autonomous insertion approaches. Compared to a 35% success rate in subretinal bleb generation with previous approaches, our method reliably created subretinal blebs in 90% our experiments. The source code and data used in this study are publicly available on GitHub 1.
|
|
16:45-16:50, Paper WeET16.3 | |
6-DoF Shape Servoing of Deformable Objects in Co-Rotated Space of Modal Graph |
|
Yang, Bohan | The Chinese University of Hong Kong |
Huang, Tianyu | The Chinese University of Hong Kong |
Zhong, Fangxun | The Chinese University of Hong Kong, Shenzhen |
Liu, Yunhui | Chinese University of Hong Kong |
Keywords: Visual Servoing, Dexterous Manipulation, Robust/Adaptive Control
Abstract: Shape control of deformable objects under both rotational and translational deformations is important for versatile robotic applications. However, deformation control with full 6-degree-of-freedom (DoF) manipulation is an open problem, since modeling and describing rotational deformations lead to significant challenges. To tackle the problem, this paper proposes a novel method by introducing a co-rotated space for the modal graph representation of objects with unknown physical and geometric models. In this space, we design new deformation features that can encode local rotations while preserving a compact and low-frequency shape representation. Moreover, these features can be mapped analytically to the robot manipulation, enabling the design of adaptive control laws with guaranteed stability for unmodeled objects. Experiments on complex volumetric objects demonstrate the effectiveness and advantages of our method with raw, noisy, and unregistered point clouds. The results highlight the importance of integrating the co-rotated features to address rotational deformations.
|
|
16:50-16:55, Paper WeET16.4 | |
Deformable Gaussian Splatting for Efficient and High-Fidelity Reconstruction of Surgical Scenes |
|
Shan, Jiwei | The Chinese University of Hong Kong |
Cai, Zeyu | Shanghai Jiao Tong University |
Hsieh, Cheng-Tai | Shanghai Jiao Tong University |
Han, Lijun | Shanghai Jiao Tong University |
Cheng, Shing Shin | The Chinese University of Hong Kong |
Wang, Hesheng | Shanghai Jiao Tong University |
Keywords: Computer Vision for Medical Robotics, Surgical Robotics: Laparoscopy
Abstract: Efficient and high-fidelity reconstruction of deformable surgical scenes is a critical yet challenging task. Building on recent advancements in 3D Gaussian splatting, current methods have seen significant improvements in both reconstruction quality and rendering speed. However, two major limitations remain: (1) difficulty in handling irreversible dynamic changes, such as tissue shearing, which are common in surgical scenes; and (2) the lack of hierarchical modeling for surgical scene deformation, which reduces rendering speed. To address these challenges, we introduce EH-SurGS, an efficient and high-fidelity reconstruction algorithm for deformable surgical scenes. We propose a deformation modeling approach that incorporates the life cycle of 3D Gaussians, effectively capturing both regular and irreversible deformations, thus enhancing reconstruction quality. Additionally, we present an adaptive motion hierarchy strategy that distinguishes between static and deformable regions within the surgical scene. This strategy reduces the number of 3D Gaussians passing through the deformation field, thereby improving rendering speed. Extensive experiments on public datasets captured with static endoscopes demonstrate that our method surpasses existing state-of-the-art approaches in both reconstruction quality and rendering speed. Ablation studies further validate the effectiveness and necessity of our proposed components. We will open-source our code upon acceptance of the paper.
|
|
16:55-17:00, Paper WeET16.5 | |
One-Shot Video Imitation Via Parameterized Symbolic Abstraction Graphs |
|
Wang, Jianren | Carnegie Mellon University |
Liu, Kangni | Carnegie Mellon University |
Guo, Dingkun | Carnegie Mellon University |
Xian, Zhou | Carnegie Mellon University |
Atkeson, Christopher | CMU |
Keywords: Learning from Demonstration, Simulation and Animation
Abstract: Learning to manipulate dynamic and deformable objects from a single demonstration video holds great promise in terms of scalability. Previous approaches have predominantly focused on either replaying object relationships or actor trajectories. The former often struggles to generalize across diverse tasks, while the latter suffers from data inefficiency. Moreover, both methodologies encounter challenges in capturing invisible physical attributes, such as forces. In this paper, we propose to interpret video demonstrations through a series of Parameterized Symbolic Abstraction Graphs (PSAGs), where nodes represent objects and edges denote relationships between objects. We further ground geometric constraints through simulation to estimate non-geometric, visually imperceptible attributes. The augmented PSAGs are then applied in real robot experiments. Our approach has been validated across a range of tasks, such as Cutting Avocado, Cutting Vegetable, Pouring Liquid, Rolling Dough, and Slicing Pizza. We demonstrate successful generalization to novel objects with distinct visual and physical properties. For visualizations of the learned policies please check: https://jianrenw.com/PSAG/
|
|
17:00-17:05, Paper WeET16.6 | |
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation |
|
Liu, Zixian | Tsinghua University |
Zhang, Mingtong | UIUC |
Li, Yunzhu | Columbia University |
Keywords: Machine Learning for Robot Control, Model Learning for Control, Deep Learning in Grasping and Manipulation
Abstract: With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a textit{keypoint-based target specification} is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at url{http://kuda-dynamics.github.io/}.
|
|
17:05-17:10, Paper WeET16.7 | |
DLO Perceiver: Grounding Large Language Model for Deformable Linear Objects Perception |
|
Caporali, Alessio | University of Bologna |
Galassi, Kevin | Università Di Bologna |
Palli, Gianluca | University of Bologna |
Keywords: Computer Vision for Manufacturing, Deep Learning for Visual Perception, Recognition
Abstract: The perception of Deformable Linear Objects (DLOs) is a challenging task due to their complex and ambiguous appearance, lack of discernible features, typically small sizes, and deformability. Despite these challenges, achieving a robust and effective segmentation of DLOs is crucial to introduce robots into environments where they are currently underrepresented, such as domestic and complex industrial settings. In this context, the integration of language-based inputs can simplify the perception task while also enabling the possibility of introducing robots as human companions. Therefore, this paper proposes a novel architecture for the perception of DLOs, wherein the input image is augmented with a text-based prompt guiding the segmentation of the target DLO. After encoding the image and text separately, a Perceiver-inspired structure is exploited to compress the concatenated data into transformer layers and generate the output mask from a latent vector representation. The method is experimentally evaluated on real-world images of DLOs like electrical cables and ropes, validating its efficacy and efficiency in real practical scenarios.
|
|
WeET17 |
405 |
Large Models for Autonomous Vehicles |
Regular Session |
Chair: Billah, Syed | Pennsylvania State University |
|
16:35-16:40, Paper WeET17.1 | |
Label Anything: An Interpretable, High-Fidelity and Prompt-Free Annotator |
|
Kou, Wei-Bin | The University of Hong Kong |
Zhu, Guangxu | Shenzhen Research Institute of Big Data |
Ye, Rongguang | Southern University of Science and Technology |
Wang, Shuai | Shenzhen Institute of Advanced Technology, Chinese Academy of Sc |
Tang, Ming | Southern University of Science and Technology |
Wu, Yik-Chung | The University of Hong Kong |
Keywords: Semantic Scene Understanding, Object Detection, Segmentation and Categorization, Intelligent Transportation Systems
Abstract: Learning-based street scene semantic understanding in autonomous driving (AD) has advanced significantly recently, but the performance of the AD model is heavily dependent on the quantity and quality of the annotated training data. However, traditional manual labeling involves high cost to annotate the vast amount of required data for training robust model. To mitigate this cost of manual labeling, we propose a Label Anything Model (denoted as LAM), serving as an interpretable, high-fidelity, and prompt-free data annotator. Specifically, we firstly incorporate a pretrained Vision Transformer (ViT) to extract the latent features. On top of ViT, we propose a semantic class adapter (SCA) and an optimization-oriented unrolling algorithm (OptOU), both with a quite small number of trainable parameters. SCA is proposed to fuse ViT-extracted features to consolidate the basis of the subsequent automatic annotation. OptOU consists of multiple cascading layers and each layer contains an optimization formulation to align its output with the ground truth as closely as possible, though which OptOU acts as being interpretable rather than learning-based blackbox nature. In addition, training SCA and OptOU requires only a single pre-annotated RGB seed image, owing to their small volume of learnable parameters. Extensive experiments clearly demonstrate that the proposed LAM can generate high-fidelity annotations (almost 100% in mIoU) for multiple real-world datasets (i.e., Camvid, Cityscapes, and Apolloscapes) and CARLA simulation dataset.
|
|
16:40-16:45, Paper WeET17.2 | |
Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding |
|
Kabir, Imran | Pennsylvania State University |
Reza, Md Alimoor | Drake University |
Billah, Syed | Pennsylvania State University |
Keywords: Semantic Scene Understanding, Multi-Modal Perception for HRI, Formal Methods in Robotics and Automation
Abstract: Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 50% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at: https://github.com/Imran2205/LogicRAG.
|
|
16:45-16:50, Paper WeET17.3 | |
Discrete Contrastive Learning for Diffusion Policies in Autonomous Driving |
|
Kujanpää, Kalle | Aalto University |
Baimukashev, Daulet | Aalto University |
Munir, Farzeen | Aalto University, Finnish Center for Artificial Intelligence |
Azam, Shoaib | Aalto University, Finnish Center for Artificial Intelligence (FC |
Kucner, Tomasz Piotr | Aalto University |
Pajarinen, Joni | Aalto University |
Kyrki, Ville | Aalto University |
Keywords: Intelligent Transportation Systems, Modeling and Simulating Humans, Learning from Demonstration
Abstract: Learning to perform accurate and rich simulations of human driving behaviors from data for autonomous vehicle testing remains challenging due to human driving styles' high diversity and variance. We address this challenge by proposing a novel approach that leverages contrastive learning to extract a dictionary of driving styles from pre-existing human driving data. We discretize these styles with quantization, and the styles are used to learn a conditional diffusion policy for simulating human drivers. Our empirical evaluation confirms that the behaviors generated by our approach are both safer and more human-like than those of the machine-learning-based baseline methods. We believe this has the potential to enable higher realism and more effective techniques for evaluating and improving the performance of autonomous vehicles.
|
|
16:50-16:55, Paper WeET17.4 | |
Intelligence Evaluation Methods for Autonomous Vehicles |
|
Zhou, Junjie | Shanghai Jiao Tong University |
Wang, Lin | Shanghai Jiao Tong University |
Meng, Qiang | National University of Singapore |
Wang, Xiaofan | Shanghai University |
Keywords: Intelligent Transportation Systems, Performance Evaluation and Benchmarking, Autonomous Agents
Abstract: The rapid advancement of artificial intelligence has significantly enhanced the intelligence of autonomous vehicles (AVs). However, owing to the complexity of AV behavior and the high dimensionality of driving environments, the objective and practical quantitative evaluation of AV intelligence remains a significant and unresolved challenge. This paper proposes a robust training-based comprehensive evaluation (RTCE) system specifically designed to assess the intelligence of AVs in the time dimension. Beginning with a foundation model, the first generation of AVs is developed by training in the initial naturalistic traffic scenarios. To effectively test the intelligence of the AVs, we propose an adversarial trajectory optimization technique to generate challenging, critical test scenarios that evaluate the learning capabilities of AVs in complex environments. Through robust training in these complex scenarios, the second generation of AVs is obtained. To objectively and effectively quantify the intelligence of AVs, we further propose a comprehensive evaluation metric system encompassing five dimensions and 14 evaluation metrics. The intelligence score of each AV is computed using the objective multi-criteria decision-making approach. The proposed intelligence evaluation method is validated using various self-evolution autonomous driving algorithms. The results demonstrate that the RTCE method can quantitatively and effectively test the intelligence of AVs in a multi-dimensional and automated manner. Furthermore, the proposed method is flexible and generalizable, making it adaptable to different testing platforms and autonomous driving algorithms.
|
|
16:55-17:00, Paper WeET17.5 | |
NaVid-4D: Unleashing Spatial Intelligence in Egocentric RGB-D Videos for Vision-And-Language Navigation |
|
Liu, Haoran | Peking University |
Wan, Weikang | Peking University |
Yu, Xiqian | University of Science and Technology of China |
Li, Minghan | Galbot |
Zhang, Jiazhao | Peking University |
Zhao, Bo | Shanghai Jiao Tong University |
Chen, Zhibo | University of Science and Technology of China |
Wang, Zhongyuan | BAAI |
Zhang, Zhizheng | University of Science and Technology of China |
Wang, He | Peking University |
Keywords: AI-Based Methods, Autonomous Agents, Vision-Based Navigation
Abstract: Understanding and reasoning about the 4D space-time is crucial for Vision-and-Language Navigation (VLN). However, previous works lack in-depth exploration in this aspect, resulting in bottlenecked spatial perception and action precision of VLN agents. In this work, we introduce NaVid-4D, a Vision Language Model (VLM) based navigation agent taking the lead in explicitly showcasing the capabilities of spatial intelligence in the real world. Given natural language instructions, NaVid-4D requires only egocentric RGB-D video streams as observations to perform spatial understanding and reasoning for generating precise instruction-following robotic actions. NaVid-4D learns navigation policies using the data from simulation environments and is endowed with precise spatial understanding and reasoning capabilities using web data. Without the need to pre-train an RGB-D foundation model, we propose a method capable of directly injecting the depth features into the visual encoder of a VLM. We further compare the use of factually captured depth information with the monocularly estimated one and find NaVid-4D works well with both while using estimated depth offers greater generalization capability and better mitigates the sim-to-real gap. Extensive experiments demonstrate that NaVid-4D achieves state-of-the-art performance in simulation environment and makes impressive VLN performance with spatial intelligence happen in the real world.
|
|
17:00-17:05, Paper WeET17.6 | |
Generating Out-Of-Distribution Scenarios Using Language Models |
|
Aasi, Erfan | Massachusetts Institute of Technology |
Nguyen, Phat | University of Massachusetts Amherst |
Sreeram, Shiva | MIT |
Rosman, Guy | Massachusetts Institute of Technology |
Karaman, Sertac | Massachusetts Institute of Technology |
Rus, Daniela | MIT |
Keywords: AI-Based Methods
Abstract: The deployment of autonomous vehicles controlled by machine learning techniques requires extensive testing in diverse real-world environments, robust handling of edge cases and out-of-distribution scenarios, and comprehensive safety validation to ensure that these systems can navigate safely and effectively under unpredictable conditions. Addressing Out-Of-Distribution (OOD) driving scenarios is essential for enhancing safety, as OOD scenarios help validate the reliability of the models within the vehicle’s autonomy stack. However, generating OOD scenarios is challenging due to their long-tailed distribution and rarity in urban driving datasets. Recently, Large Language Models (LLMs) have shown promise in autonomous driving, particularly for their zero-shot generalization and common-sense reasoning capabilities. In this paper, we leverage these LLM strengths to introduce a framework for generating diverse OOD driving scenarios. Our approach uses LLMs to construct a branching tree, where each branch represents a unique OOD scenario. These scenarios are then simulated in the CARLA simulator using an automated framework that aligns scene augmentation with the corresponding textual descriptions. We evaluate our framework through extensive simulations, and assess its performance via a diversity metric that measures the richness of the scenarios. Additionally, we introduce a new "OOD-ness" metric, which quantifies how much the generated scenarios deviate from typical urban driving conditions. Furthermore, we explore the capacity of modern Vision-Language Models (VLMs) to interpret and safely navigate through the simulated OOD scenarios. Our findings offer valuable insights into the reliability of language models in addressing OOD scenarios within the context of urban driving.
|
|
17:05-17:10, Paper WeET17.7 | |
MAGIC-VFM - Meta-Learning Adaptation for Ground Interaction Control with Visual Foundation Models |
|
Lupu, Elena-Sorina | California Institute of Technology |
Xie, Fengze | California Institute of Technology |
Preiss, James | Caltech |
Alindogan, Jedidiah | California Institute of Technology |
Anderson, Matthew | Caltech |
Chung, Soon-Jo | Caltech |
Keywords: Model Learning for Control, Learning and Adaptive Systems, Field Robots, Visual Foundation Models
Abstract: Control of off-road vehicles is challenging due to the complex dynamic interactions with the terrain. Accurate modeling of these interactions is important to optimize driving performance, but the relevant physical phenomena are too complex to model from first principles. Therefore, we present an offline meta-learning algorithm to construct a rapidly-tunable model of residual dynamics and disturbances. Our model processes terrain images into features using a visual foundation model (VFM), then maps these features and the vehicle state to an estimate of the current actuation matrix using a deep neural network (DNN). We then combine this model with composite adaptive control to modify the last layer of the DNN in real time, accounting for the remaining terrain interactions not captured during offline training. We provide mathematical guarantees of stability and robustness for our controller, and demonstrate the effectiveness of our method through simulations and hardware experiments with a tracked vehicle and a car-like robot. We evaluate our method outdoors on different slopes with varying slippage and actuator degradation disturbances, and compare against an adaptive controller that
|
|
17:10-17:15, Paper WeET17.8 | |
DINO-MOT: 3D Multi-Object Tracking with Visual Foundation Model for Pedestrian Re-Identification Using Visual Memory Mechanism |
|
Lee, Min Young | National University of Singapore |
Lee, Christina Dao Wen | National University of Singapore |
Jianghao, Li | National University of Singapore |
Ang Jr, Marcelo H | National University of Singapore |
Keywords: Intelligent Transportation Systems, Human Detection and Tracking, Deep Learning for Visual Perception
Abstract: In the advancing domain of autonomous driving, this research focuses on enhancing 3D Multi-Object Tracking (3D-MOT). Pedestrians are particularly vulnerable in urban environments, and robust tracking methodologies are required to understand their movements. Prevalent Tracking-By-Detection (TBD) frameworks often underutilize the rich visual data from sensors such as cameras. This study leverages the advanced visual foundation model, DINOv2, to refine the TBD framework by incorporating camera modality, thereby improving pedestrian tracking consistency and overall 3D-MOT performance. The proposed DINO-MOT framework is the first application of DINOv2 for enhancing 3D-MOT through pedestrian Re-Identification (Re-ID), and Score Filter Ceiling is implemented to prevent premature exclusion of low-confidence 3D detections during tracking association. Furthermore, utilization of DINOv2 as a feature extractor within the DINO-MOT framework reduces pedestrian ID switches by up to 12.3%. Achieving AMOTA of 76.3% on the nuScenes test dataset, DINO-MOT has set a new benchmark in the 3D MOT literature with an improvement of 0.5%, securing the top rank on the leaderboard. Furthermore, this research paves the potential of applying a visual foundation model to improve the existing TBD framework, to enhance 3D-MOT in autonomous driving.
|
|
WeET18 |
406 |
Surgical Robotics: Steerable Catheters/Needles 2 |
Regular Session |
Chair: Hoelscher, Janine | Clemson |
Co-Chair: Krieger, Axel | Johns Hopkins University |
|
16:35-16:40, Paper WeET18.1 | |
Hysteresis Compensation of Tendon-Sheath Mechanism Using Nonlinear Programming Based on Preisach Model |
|
Kim, Hongmin | Massachusetts Institute of Technology |
Kim, Dongchan | KAIST (Korea Advanced Institute of Science and Technology) |
Park, Su Hyeon | Pusan National University |
Jin, Sangrok | Pusan National University |
Keywords: Tendon/Wire Mechanism, Medical Robots and Systems, Surgical Robotics: Laparoscopy
Abstract: Tendon sheath mechanism (TSM) is an essential mechanical element for the implementation of flexible endoscopic systems owing to its small volume and simple structure. However, nonlinear characteristics, such as backlash, hysteresis and friction occur when employing such a component. In this study, we formulate a Preisach hysteresis model consisting of elementary hysteresis operators. Subsequently, we propose a compensation algorithm that repeatedly and sequentially solves a nonlinear optimization problem online, producing an inverse control signal for the desired output at every time step, compensating the nonlinear effects of TSM. The results indicate that the presented model and control scheme are promising for motion control in any application utilizing TSM.
|
|
16:40-16:45, Paper WeET18.2 | |
Resolution Optimal Motion Planning for Medical Needle Steering from Airway Walls in the Lung |
|
Hoelscher, Janine | Clemson |
Fried, Inbar | University of North Carolina at Chapel Hill |
Salzman, Oren | Technion |
Alterovitz, Ron | University of North Carolina at Chapel Hill |
Keywords: Surgical Robotics: Planning, Surgical Robotics: Steerable Catheters/Needles, Nonholonomic Motion Planning
Abstract: Steerable needles are novel medical devices capable of following curved paths through tissue, enabling them to avoid anatomical obstacles and steer to hard-to-reach sites in tissue, including targets in the lung for lung cancer diagnosis. Steerable needles are typically deployed into tissue from an insertion surface, and selecting the insertion site is critical for procedure success as it determines which paths the needle can take to its target. Prior motion planners for steerable needles typically only plan from a specific start pose to the target. We introduce a new resolution-optimal steerable needle motion planner that efficiently finds plans from an insertion surface to a target position, handling additional degrees of freedom at both the start and the target. Our algorithm systematically builds a search tree consisting of needle motion primitives backward from the target towards the insertion surface, which allows it to provide an optimality guarantee up to the resolution of the primitives. The algorithm finds higher-quality plans faster than prior state-of-the-art motion planners, as demonstrated in anatomical scenario simulations in the lung.
|
|
16:45-16:50, Paper WeET18.3 | |
Self-Sufficient 5-DoF Discrete Global Localization for Magnetically-Actuated Endoscope in Bronchoscopy |
|
Tan, Jiewen | The Chinese University of Hong Kong |
Zhao, Da | The Chinese University of Hong Kong |
Zhou, Rui | The Chinese University of Hong Kong |
Xie, Wenxuan | The Chinese University of Hong Kong |
Cheng, Shing Shin | The Chinese University of Hong Kong |
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles
Abstract: Existing sensor-based global localization methods limit the miniaturization potential of magnetically-actuated endoscopes (MAE) while localization based on external medical imaging demands accurate registration and imposes a variety of modality-specific challenges during continuous image acquisition. This work proposes a novel self-sufficient method for discrete (one-time) global localization of an MAE based solely on inherent endoscopic images without any prior MAE pose information. More specifically, it adopts a model-free control approach to determine five different external magnet (EM) poses (corresponding to five independent nonlinear equations) that can align the MAE image center with the lumen center while the MAE maintains the same pose. The five degree-of freedom (DoF) global pose of the MAE can then be estimated by minimizing the root mean square of MAE's torque balance residuals under these EM poses. Our proposed method achieves similar accuracy as other sensor-based methods for permanent magnet-driven MAE with 6.7 ± 2.1 mm position error and 9.5 ± 2.9° orientation error in the experiments. Compared to existing methods, our approach does not require physical sensor integration, enabling a more compact endoscope design for exploration in narrower respiratory tracts. It also offers a critical step toward achieving sensorless and continuous global localization of the permanent magnet-driven MAE during its autonomous navigation.
|
|
16:50-16:55, Paper WeET18.4 | |
Intraoperative 3D Shape Estimation of Magnetic Soft Guidewire |
|
Zhao, Yiting | Beijing Institute of Technology |
Shi, Liwei | Beijing Institute of Technology |
Xiao, Nan | Beijing Institute of Technology |
Keywords: Surgical Robotics: Planning, Soft Robot Applications, Sensor Fusion
Abstract: 本文介绍了一种 3D 形状重建技术 用于血管内手术中的介入装置, 利用灵活的磁性尖端导丝,保持 标准导丝的基本属性。我们 开发了一个将磁头形状相关联的模型 周围磁场分布为 通过磁场估计形状。这 磁性 磁场分布和磁导丝的形状 为直接形状估计带来了挑战。要解决 为此,我们将图像和物理约束合并到 简化估算过程。此方法显示高 形状估计的准确性和稳定性,带均值根 平方误差 (RMSE) 和豪斯多夫距离 (HD) 均在下方 1 毫米,这比其他现有的估计要好 方法。值得注意的是,介入导丝不需要 嵌入式传感器或布线,以及透视图像 使用的是临床实践中的标准。重建 过程不ߩ
|
|
16:55-17:00, Paper WeET18.5 | |
Semi-Autonomous 2.5D Control of Untethered Magnetic Suture Needle |
|
Wang, Qinhan | Johns Hopkins University |
Bhattacharjee, Anuruddha | The Johns Hopkins University |
Chen, Xinhao | Johns Hopkins University |
Mair, Lamar | Weinberg Medical Physics, Inc |
Diaz-Mercado, Yancy | University of Maryland |
Krieger, Axel | Johns Hopkins University |
Keywords: Medical Robots and Systems, Surgical Robotics: Steerable Catheters/Needles, Manipulation Planning
Abstract: Untethered miniature surgical tools could significantly reduce invasiveness and enhance patient outcomes in robot-assisted laparoscopic surgical procedures. This paper demonstrates the feasibility of performing semi-autonomous suturing tasks using an untethered magnetic needle controlled by an external electromagnetic manipulator. The electromagnetic manipulator can generate magnetic torques and gradient-based pulling forces to actuate the magnetic needle. Here, we develop and implement a semi-autonomous 2.5D control method for controlling the in-plane position and both in-plane and out-of-plane orientations of a magnetic needle for suturing on tissue-mimicking agar gel phantoms. The method includes recognizing needles and incisions, planning trajectory, and performing suturing with visual feedback control. We conduct two mock suturing tasks using both continuous and interrupted techniques on 1% agar gel phantoms with 2 cm and 3 cm incision sizes. The results demonstrate precise needle control, with an average root-mean-square position error of 1.01 mm and 1.12 mm across tasks. The system also achieved submillimeter-level suture spacing accuracy, comparable to surgeons using state-of-the-art surgical robots. These findings highlight the feasibility of using untethered magnetic suture needles for minimally invasive suturing procedures.
|
|
17:00-17:05, Paper WeET18.6 | |
Steerable Tape-Spring Needle for Autonomous Sharp Turns through Tissue |
|
Abdoun, Omar | University of Pennsylvania |
Tjandra, Davin | University of Pennsylvania |
Yin, Katie | University of California, Riverside |
Kurzan, Pablo | University of Pennsylvania |
Yin, Jessica | University of Pennsylvania |
Yim, Mark | University of Pennsylvania |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Surgical Robotics: Planning, Surgical Robotics: Laparoscopy
Abstract: Steerable needles offer a minimally invasive method to deliver treatment to hard-to-reach tissue regions. We introduce a new class of textit{tape-spring} steerable needles, capable of sharp turns ranging from 15 to 150 degrees with a turn radius of as low as 3mm, which minimize surrounding tissue damage. In this work, we derive and experimentally validate a geometric model for our steerable needle design. We evaluate both manual and robotic steering of the needle along a Dubins path in 7 kPa and 13 kPa tissue phantoms, simulating our target clinical application in healthy and unhealthy liver tissue. We conduct experiments to measure needle robustness to stiffness transitions between non-homogeneous tissues. We demonstrate progress towards clinical use with needle tip tracking via ultrasound imaging, navigation around anatomical obstacles, and integration with a robotic autonomous steering system.
|
|
17:05-17:10, Paper WeET18.7 | |
Shape Control of Concentric Tube Robots Via Approximate Follow-The-Leader Motion |
|
Xu, Yunti | University of California, San Diego |
Watson, Connor | Morimoto Lab, UCSD |
Lin, Jui-Te | University of California, San Diego |
Hwang, John T. | University of California, San Diego |
Morimoto, Tania K. | University of California San Diego |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Modeling, Control, and Learning for Soft Robots, Medical Robots and Systems
Abstract: Concentric tube robots (CTRs) are miniaturized continuum robots that are promising for robotic minimally invasive surgeries. Control methods to date have primarily focused on controlling the robot tip. However, small changes in the tip position can result in large deviations in the shape of the robot body, motivating the need for shape control to ensure safe navigation in constrained environments. One proposed method for shape control, known as follow-the-leader (FTL) motion, allows the robot to deploy while occupying minimal volume but is limited to specific CTR designs and deployment sequences. In this paper, we propose a shape control method that approximates FTL motion and is applicable to arbitrary tip navigation tasks without requiring a predefined trajectory or specific tube design. This shape control method is framed as a nonlinear optimization problem, and through linearization of the CTR's kinematics, we turn it into a quadratic program solved by a shape controller that requires minimal knowledge of the robot's shape. Simulation results show that our proposed shape control method enables better approximate FTL motion compared to a state-of-the-art Jacobian-based tip controller across different tube sets and tip paths while remaining computationally comparable. Furthermore, a hardware demonstration validates the effectiveness of the shape controller on a physical system during teleoperation.
|
|
17:10-17:15, Paper WeET18.8 | |
Model-Based Parameter Selection for a Steerable Continuum Robot — Applications to Bronchoalveolar Lavage (BAL) |
|
Rothe, Amber K. | Georgia Institute of Technology |
Brumfiel, Timothy A. | Georgia Institute of Technology |
Konda, Revanth | Georgia Institute of Technology |
Williams, Kirsten | Emory University |
Desai, Jaydev P. | Georgia Institute of Technology |
Keywords: Surgical Robotics: Steerable Catheters/Needles, Tendon/Wire Mechanism, Medical Robots and Systems
Abstract: Bronchoalveolar lavage (BAL) is a minimally invasive procedure for diagnosing lung infections and diseases. However, navigating tortuous lung anatomy to the distal branches of the bronchoalveolar tree for adequate sampling using BAL remains challenging. Continuum robots have been used to improve the navigation of guidewires, catheters, and endoscopes and could be applied to the BAL procedure as well. One class of continuum robots is constructed from a notched tube and actuated using a tendon. Many tendon-driven notched continuum robots use uniform machining parameters to achieve approximately constant-curvature configurations, which may be unsuitable for traversing the tortuous anatomy of the lungs. This letter presents a model that predicts the curvature of a robot with arbitrary notch shapes subjected to tendon tension. The model predicted the deflection of rectangular, elliptical, and sinusoidal notches in a 0.89 mm diameter nitinol tube with 2.32%, 3.65%, and 6.32% error, respectively. Furthermore, an algorithm is developed to determine the optimal pattern of notches to achieve a desired nonuniform robot curvature. A simulated robot designed using the algorithm achieved the desired shape with a root mean square error (RMSE) of 1.52°. Additionally, we present a model for predicting the shape of nonuniformly notched continuum robots which incorporates friction and pre-curvature. This model predicted the shape of a continuum robot with nonuniform rectangular notches with an average RMSE of 5.20° with respect to the actual robot. We also demonstrated navigating the continuum robot through a pulmonary phantom.
|
|
WeET19 |
407 |
Logistics and Task Planning |
Regular Session |
Co-Chair: Arras, Kai Oliver | University of Stuttgart |
|
16:35-16:40, Paper WeET19.1 | |
A New Clustering-Based View Planning Method for Building Inspection with Drone |
|
Zheng, Yongshuai | Shandong University |
Liu, Guoliang | Shandong University |
Ding, Yan | Shandong University |
Tian, Guohui | Shandong University |
Keywords: Task Planning, Surveillance Robotic Systems, Computational Geometry
Abstract: With the rapid development of drone technology, the application of drones equipped with visual sensors for building inspection and surveillance has attracted much attention. View planning aims to find a set of near-optimal viewpoints for vision-related tasks to achieve the vision coverage goal. This paper proposes a new clustering-based two-step computational method using spectral clustering, local potential field method, and hyper-heuristic algorithm to find near-optimal views to cover the target building surface. In the first step, the proposed method generates candidate viewpoints based on spectral clustering and corrects the positions of candidate viewpoints based on our newly proposed local potential field method. In the second step, the optimization problem is converted into a Set Covering Problem (SCP), and the optimal viewpoint subset is solved using our proposed hyper-heuristic algorithm. Experimental results show that the proposed method is able to obtain better solutions with fewer viewpoints and higher coverage.
|
|
16:40-16:45, Paper WeET19.2 | |
Towards the Deployment of an Autonomous Last-Mile Delivery Robot in Urban Areas |
|
Santamaria-Navarro, Angel | Universitat Politècnica De Catalunya |
Hernandez Juan, Sergi | CSIC-UPC (IRI) |
Herrero Cotarelo, Fernando | IRI, CSIC-UPC |
López Gestoso, Alejandro | Institut De Robòtica I Informàtica Industrial |
Del Pino, Ivan | Instituto Universitario De Investigación Informática (IUII). Uni |
Rodriguez Linares, Nicolás Adrián | Universidad Politécnica De Cataluña |
Fernandez, Carlos | Urbiotica |
Baldó i Canut, Albert | CARNET Future Mobility Research Hub |
Lemardelé, Clément | Universitat Politècnica De Catalunya |
Garrell, Anais | UPC-CSIC |
Vallvé, Joan | CSIC-UPC |
Taher, Hafsa | Institut De Robòtica I Informàtica Industrial, CSIC-UPC |
Puig-Pey, Ana | Universitat Politecnica De Catalunya |
Pagès, Laia | CARNET |
Sanfeliu, Alberto | Universitat Politècnica De Cataluyna |
Keywords: Intelligent Transportation Systems, Logistics, Field Robots
Abstract: Nowadays, the skyrocketing last-mile freight transportation in urban areas is leading to very negative effects (e.g., pollution, noise or traffic congestion), which could be minimized by using autonomous electric vehicles. In this sense, this paper presents the first prototype of Ona, an autonomous last-mile delivery robot that, in contrast to existing platforms, has a medium-sized storage capacity with the capability of navigating in both street and pedestrian areas. Here, we describe the platform, its main Software modules and the validation experiments, carried out in the Barcelona Robot Lab (Universitat Politècnica de Catalunya); Esplugues de Llobregat (next to Barcelona); and Debrecen (Hungary), which are representative urban scenarios. Apart from robotic technical details, we also include the results of the technology acceptance by the public present in the Esplugues de Llobregat test, collected in situ through a survey.
|
|
16:45-16:50, Paper WeET19.3 | |
Multi-Heuristic Robotic Bin Packing of Regular and Irregular Objects |
|
Nickel, Tim | Fraunhofer IPA |
Bormann, Richard | Fraunhofer IPA |
Arras, Kai Oliver | University of Stuttgart |
Keywords: Logistics, Manipulation Planning, Factory Automation
Abstract: The increasing demand in e-commerce, combined with labor shortages and rising wages, is driving the rapid automation of warehouse operations. A critical aspect of this shift is bin packing, where diverse unknown items of varying sizes and shapes must be optimally arranged within a bin or container. Robot bin packing is receiving growing attention and presents unique challenges due to the broad range of objects, packing rules, and task-specific requirements. In response, we propose So-Pack, a generalist packing heuristic for irregularly shaped objects integrated into a flexible, weighted multi-heuristic planning system. The system demonstrates robust performance across general packing scenarios and exhibits the flexibility to adapt to changing packing rules and specific end-user requirements. Experimental results show that the system outperforms state-of-the-art approaches in key metrics in a new challenging dataset of retail objects in real-world applications.
|
|
16:50-16:55, Paper WeET19.4 | |
MultiTalk: Introspective and Extrospective Dialogue for Human-Environment-LLM Alignment |
|
Devarakonda, Venkata Naren | New York University |
Kaypak, Ali Umut | New York University |
Yuan, Shuaihang | New York University |
Krishnamurthy, Prashanth | New York University Tandon School of Engineering |
Fang, Yi | New York University |
Khorrami, Farshad | New York University Tandon School of Engineering |
Keywords: Task Planning, AI-Enabled Robotics, Manipulation Planning
Abstract: LLMs have shown promising results in task planning due to their strong natural language understanding and reasoning capabilities. However, issues such as hallucinations, ambiguities in human instructions, environmental constraints, and limitations in the executing agent’s capabilities often lead to flawed or incomplete plans. This paper proposes MultiTalk, an LLM-based task planning methodology that addresses these issues through a framework of introspective and extrospective dialogue loops. This approach helps ground generated plans in the context of the environment and the agent's capabilities, while also resolving uncertainties and ambiguities in the given task. These loops are enabled by specialized systems designed to extract and predict task-specific states, and flag mismatches or misalignments among the human user, the LLM agent, and the environment. Effective feedback pathways between these systems and the LLM planner foster meaningful dialogue. The efficacy of this methodology is demonstrated through its application to robotic manipulation tasks. Experiments and ablations highlight the robustness and reliability of our method, and comparisons with baselines further illustrate the superiority of MultiTalk in task planning for embodied agents. Project Website: https://llm-multitalk.github.io/
|
|
16:55-17:00, Paper WeET19.5 | |
Goal-Guided Reinforcement Learning: Leveraging Large Language Models for Long-Horizon Task Decomposition |
|
Zhang, Ceng | National University of Singapore |
Sun, Zhanhong | National University of Singapore |
Chirikjian, Gregory | University of Delaware |
Keywords: Task Planning, Reinforcement Learning
Abstract: Reinforcement learning (RL) has long struggled with exploration in vast state-action spaces, particularly for intricate tasks that necessitate a series of well-coordinated actions. Meanwhile, large language models (LLMs) equipped with fundamental knowledge have been utilized for task planning across various domains. However, using them to plan for long-term objectives can be demanding, as they function independently from task environments where their knowledge might not be perfectly aligned, hence often overlooking possible physical limitations. To this end, we propose a goal-based RL framework that leverages prior knowledge of LLMs to benefit the training process. We introduce a hierarchical module that features a goal generator to segment a long-horizon task into reachable subgoals and a policy planner to generate action sequences based on the current goal. Subsequently, the policies derived from LLMs guide the RL to achieve each subgoal sequentially. We validate the effectiveness of the proposed framework across different simulation environments and long-horizon tasks with complex state and action spaces.
|
|
17:00-17:05, Paper WeET19.6 | |
Trustworthy Robot Behavior Tree Generation Based on Multi-Source Heterogeneous Knowledge Graph |
|
Yuan, Jianchao | National University of Defense Technology |
Yang, Shuo | National University of Defense Technology |
Zhang, Qi | National University of Defense Technology |
Li, Ge | National University of Defense Technology |
Tang, Jianping | National University of Defense Technology |
Keywords: Task Planning, Software Architecture for Robotic and Automation
Abstract: In robotics, the design of robot behavior trees generally requires roboticists to comprehensively and customizable consider all the relevant factors including the robot hardware capabilities, task descriptions, etc, posing great challenges for design quality and efficiency. The mainstream practice of BT design paradigm has been utilizing the BT component framework to develop task-specific BT structures manually. In contrast, the latest advances in Generative Pretrained Transformers (GPTs) have also opened up the possibility of BT design automation. However, these approaches generally show low efficiency or are less trustworthy for complex robot task goals due to time-consuming manual design and unreliable GPT reasoning. To solve the above limitations, this paper proposes a novel knowledge-driven approach that develops a specialized knowledge graph from multi-sourced and heterogeneous high-quality robot knowledge to reason out a trustworthy robot plan for achieving complex task goals. Then we present the plan transformation and BT merging algorithms to automatically generate the plan-level BT structure. The comparative experiment results have shown that our approach can generate high-quality and trustworthy BT structure regarding the task plan accuracy and consistency, as well as the BT generation time, compared with the manual design and GPT-based approaches.
|
|
17:05-17:10, Paper WeET19.7 | |
Physics-Aware Robotic Palletization with Online Masking Inference |
|
Zhang, Tianqi | Tsinghua University |
Wu, Zheng | University of California, Berkeley |
Chen, Yuxin | University of California, Berkeley |
Wang, Yixiao | University of California, Berkeley |
Liang, Boyuan | University of California, Berkeley |
Moura, Scott | UC Berkeley |
Tomizuka, Masayoshi | University of California |
Ding, Mingyu | UC Berkeley |
Zhan, Wei | Univeristy of California, Berkeley |
Keywords: Task Planning, Reinforcement Learning
Abstract: The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy towards valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings.
|
|
17:10-17:15, Paper WeET19.8 | |
Enabling In-Flight Metamorphosis in Multirotors with a Center-Driven Scissor Extendable Airframe for Adaptive Navigation |
|
Yang, Tao | Harbin Institute of Technology, Shenzhen |
Li, Peng | Harbin Institute of Technology ShenZhen |
Wang, Gang | University of Shanghai for Science and Technology |
Shen, Yantao | University of Nevada, Reno |
Keywords: Foundations of Automation, Autonomous Vehicle Navigation
Abstract: To address complex mission tasks, multirotors benefit from in-flight reconfiguration that enhances their morphological adaptability. This paper presents the Center-Driven Scissor Extendable Airframe (CDSEA), a novel one-degree-of-freedom (DOF) morphing airframe designed to replace traditional fixed-size airframes. The CDSEA allows a quadrotor to achieve significant morphological changes during flight, with rotors deploying radially from a central point. This capability facilitates substantial variations in footprint radius and ensures smooth transitions. The paper details the mechanical design, as well as kinematic and dynamic analyses, and discusses the actuator selection strategy for the CDSEA. Experimental results with a prototype demonstrate that the CDSEA achieves a footprint-radius deformation ratio of 2.5 and a morphing time of 0.3 seconds, surpassing existing solutions. Additionally, the design improves obstacle avoidance and wind resistance. These results underscore the CDSEA's potential as an advanced solution for enhancing UAV adaptive navigation performance in complex environments.
|
|
WeET20 |
408 |
Planning Around People for Social Navigation |
Regular Session |
Chair: Mendez, Oscar | University of Surrey |
Co-Chair: Mavrogiannis, Christoforos | University of Michigan |
|
16:35-16:40, Paper WeET20.1 | |
SafePCA: Enhancing Autonomous Robot Navigation in Dynamic Crowds Using Proximal Policy Optimization and Cellular Automata |
|
Farouq, Ardiansyah | Telkom University |
Tran, Dinh Tuan | College of Information Science and Engineering, Ritsumeikan Univ |
Lee, Joo-Ho | Ritsumeikan University |
Keywords: Motion and Path Planning, Machine Learning for Robot Control, Localization
Abstract: Navigating robots in dynamic environments, such as human crowds, is a major challenge due to the trade-off between performance and robustness. Traditional reinforcement learning methods, such as Proximal Policy Optimization (PPO), have shown strong adaptation capabilities but require extensive training and lack explicit mechanisms for collision avoidance. On the other hand, rule-based approaches, such as the Dynamic Window Approach (DWA), offer computational efficiency but struggle with generalization to unseen crowd behaviors. The proposed SafePCA framework aims to address this trade-off by integrating Cellular Automata (CA) into PPO-based navigation. CA enhances robustness by predicting high-risk areas based on pedestrian movement patterns, reducing unnecessary collisions. However, this approach may lead to conservative behavior, potentially affecting navigation performance in reaching the goal efficiently. The core research question addressed in this work is whether SafePCA can balance these trade-offs to ensure safe yet efficient robot navigation in dynamic crowds. Experiments demonstrate that SafePCA outperforms traditional PPO by providing superior risk assessment and avoidance strategies, achieving optimal performance with fewer training episodes. SafePCA’s real-time adaptability ensures robust navigation in dynamic environments. By leveraging PPO’s adaptive learning and CA’s risk analysis, SafePCA offers an efficient solution for autonomous robot navigation in crowded environments, advancing the field and broadening application possibilities.
|
|
16:40-16:45, Paper WeET20.2 | |
Robot Local Planner: A Periodic Sampling-Based Motion Planner with Minimal Waypoints for Home Environments |
|
Takeshita, Keisuke | Toyota Motor Corporation |
Yamazaki, Takahiro | Toyota Motor Corporation |
Ono, Tomohiro | Toyota Motor Corporation |
Yamamoto, Takashi | Aichi Institute of Technology |
Keywords: Motion and Path Planning, Mobile Manipulation, Manipulation Planning
Abstract: The objective of this study is to enable fast and safe manipulation tasks in home environments. Specifically, we aim to develop a system that can recognize its surroundings and identify target objects while in motion, enabling it to plan and execute actions accordingly. We propose a periodic sampling-based whole-body trajectory planning method, called the “Robot Local Planner (RLP).” This method leverages unique features of home environments to enhance computational efficiency, motion optimality, and robustness against recognition and control errors, all while ensuring safety. The RLP minimizes computation time by planning with minimal waypoints and generating safe trajectories. Furthermore, overall motion optimality is improved by periodically executing trajectory planning to select more optimal motions. This approach incorporates inverse kinematics that are robust to base position errors, further enhancing robustness. Evaluation experiments demonstrated that the RLP outperformed existing methods in terms of motion planning time, motion duration, and robustness, confirming its effectiveness in home environments. Moreover, application experiments using a tidy-up task achieved high success rates and short operation times, thereby underscoring its practical feasibility.
|
|
16:45-16:50, Paper WeET20.3 | |
Diff-Refiner: Enhancing Multi-Agent Trajectory Prediction with a Plug-And-Play Diffusion Refiner |
|
Zhou, Xiangzheng | Nanjing University of Science and Technology |
Chen, Xiaobo | Shandong Technology and Business University |
Yang, Jian | Nanjing University of Science & Technology |
Keywords: Motion and Path Planning, Planning under Uncertainty
Abstract: The inherent stochasticity of the agents’ behavior presents a challenge to trajectory prediction models, which are required to generate multiple plausible future trajectories. Recently, diffusion models have been applied to implement multimodal trajectory prediction. Existing approaches typically employ a standard diffusion process, denoising from a sample drawn from a Gaussian distribution. However, we identify that most agents exhibit an obvious movement trend, rendering many initial denoising steps redundant—primarily transitioning from pure noise to an initial coarse trajectory. To conquer this challenge, this paper innovatively proposes a diffusion refiner that can be used along with existing multi-agent trajectory prediction models to improve their performance. Specifically, we first leverage a baseline model for predicting the coarse future trajectory. Then, the diffusion model is applied as a refiner to reduce the prediction error. Moreover, our method is naturally plug-and-play, allowing convenient integration with existing models. To achieve this, we improve the traditional diffusion process to not only converge towards noise but also the coarse predictions from the baseline model. In such a case, standard step-skipping sampling techniques is inapplicable and we further propose an ordinary differential equation (ODE)-based fast sampling method. Extensive experiments with selected baseline models demonstrate the effectiveness of our approach.
|
|
16:50-16:55, Paper WeET20.4 | |
Scene-Aware Explainable Multimodal Trajectory Prediction |
|
Liu, Pei | The Hong Kong University of Science and Technology(GuangZhou) |
Liu, Haipeng | Shanghai Li Auto Co., Ltd |
Liu, Xingyu | Shenyang Agricultural University |
Li, Yiqun | Southeast University |
Chen, Junlan | Monash University |
He, Yangfan | University of Minnesota - Twin Cities |
Ma, Jun | The Hong Kong University of Science and Technology |
Keywords: Motion and Path Planning, Computer Vision for Transportation, Robust/Adaptive Control
Abstract: Advancements in intelligent technologies have significantly improved navigation in complex traffic environments by enhancing environment perception and trajectory prediction for automated vehicles. However, current research often overlooks the joint reasoning of scenario agents and lacks explainability in trajectory prediction models, limiting their practical use in real-world situations. To address this, we introduce the Explainable Conditional Diffusion-based Multimodal Trajectory Prediction (DMTP) model, which is designed to elucidate the environmental factors influencing predictions and reveal the underlying mechanisms. Our model integrates a modified conditional diffusion approach to capture multimodal trajectory patterns and employs a revised Shapley Value model to assess the significance of global and scenario-specific features. Experiments using the Waymo Open Motion Dataset demonstrate that our explainable model excels in identifying critical inputs and significantly outperforms baseline models in accuracy. Moreover, the factors identified align with the human driving experience, underscoring the model’s effectiveness in learning accurate predictions. Code is available in our open-source repository: https://github. com/ocean-luna/Explainable-Prediction.
|
|
16:55-17:00, Paper WeET20.5 | |
Safety-Critical Traffic Simulation with Adversarial Transfer of Driving Intentions |
|
Huang, Zherui | Shanghai Jiao Tong University |
Gao, Xing | Shanghai AI Lab |
Zheng, Guanjie | Shanghai Jiaotong University |
Wen, Licheng | Shanghai AI Laboratory |
Yang, Xuemeng | Shanghai Artificial Intelligence Laboratory |
Sun, Xiao | Shanghai AI Laboratory, China |
Keywords: Collision Avoidance, Intelligent Transportation Systems, Deep Learning Methods
Abstract: Traffic simulation, complementing real-world data with a long-tail distribution, allows for effective evaluation and enhancement of the ability of autonomous vehicles to handle accident-prone scenarios. Simulating such safety-critical scenarios is nontrivial, however, from log data that are typically regular scenarios, especially in consideration of dynamic adversarial interactions between the future motions of autonomous vehicles and surrounding traffic participants. To address it, this paper proposes an innovative and efficient strategy, termed IntSim, that explicitly decouples the driving intentions of surrounding actors from their motion planning for realistic and efficient safety-critical simulation. We formulate the adversarial transfer of driving intention as an optimization problem, facilitating extensive exploration of diverse attack behaviors and efficient solution convergence. Simultaneously, intention-conditioned motion planning benefits from powerful deep models and large-scale real-world data, permitting the simulation of realistic motion behaviors for actors. Specially, through adapting driving intentions based on environments, IntSim facilitates the flexible realization of dynamic adversarial interactions with autonomous vehicles. Finally, extensive open-loop and closed-loop experiments on real-world datasets, including nuScenes and Waymo, demonstrate that the proposed IntSim achieves state-of-the-art performance in simulating realistic safety-critical scenarios and further improves planners in handling such scenarios.
|
|
17:00-17:05, Paper WeET20.6 | |
The Radiance of Neural Fields: Democratizing Photorealistic and Dynamic Robotic Simulation |
|
Alcolado Nuthall, Georgina E | University of Surrey |
Bowden, Richard | University of Surrey |
Mendez, Oscar | University of Surrey |
Keywords: Simulation and Animation, Human-Centered Robotics, Software Tools for Robot Programming
Abstract: As robots increasingly coexist with humans, they must navigate complex, dynamic environments rich in visual information and implicit social dynamics, like when to yield or move through crowds. Addressing these challenges requires significant advances in vision-based sensing and a deeper understanding of socio-dynamic factors, particularly in tasks like navigation. To facilitate this, robotics researchers need advanced simulation platforms offering dynamic, photorealistic environments with realistic actors. Unfortunately, most existing simulators fall short, prioritizing geometric accuracy over visual fidelity, and employing unrealistic agents with fixed trajectories and low-quality visuals. To overcome these limitations, we developed a simulator that incorporates three essential elements: (1) photorealistic neural rendering of environments, (2) neurally animated human entities with behavior management, and (3) an ego-centric robotic agent providing multi-sensor output. By utilizing advanced neural rendering techniques in a dual-NeRF simulator, our system produces high-fidelity, photorealistic renderings of both environments and human entities. Additionally, it integrates a state-of-the-art Social Force Model to model dynamic human-human and human-robot interactions, creating the first photorealistic and accessible human-robot simulation system powered by neural rendering.
|
|
17:05-17:10, Paper WeET20.7 | |
Human-Robot Cooperative Distribution Coupling for Hamiltonian-Constrained Social Navigation |
|
Wang, Weizheng | Purdue University |
Yu, Chao | Tsinghua University |
Wang, Yu | Tsinghua University |
Min, Byung-Cheol | Purdue University |
Keywords: Motion and Path Planning, Acceptability and Trust, Deep Learning Methods
Abstract: Navigating in human-filled public spaces is a critical challenge for deploying autonomous robots in real-world environments. This paper introduces NaviDIFF, a novel Hamiltonian-constrained socially-aware navigation framework designed to address the complexities of human-robot interaction and socially-aware path planning. NaviDIFF integrates a port-Hamiltonian framework to model dynamic physical interactions and a diffusion model to manage uncertainty in human-robot cooperation. The framework leverages a spatial-temporal transformer to capture social and temporal dependencies, enabling more accurate spatial-temporal environmental dynamics understanding and port-Hamiltonian physical interactive process construction. Additionally, reinforcement learning from human feedback is employed to fine-tune robot policies, ensuring adaptation to human preferences and social norms. Extensive experiments demonstrate that NaviDIFF outperforms state-of-the-art methods in social navigation tasks, offering improved stability, efficiency, and adaptability.
|
|
17:10-17:15, Paper WeET20.8 | |
Crowd Perception Communication-Based Multi-Agent Path Finding with Imitation Learning |
|
Xie, Jing | National Innovation Institute of Defense Technology |
Zhang, Yongjun | National Innovation Institute of Defense Technology |
Yang, Huanhuan | National University of Defense Technology |
Ouyang, Qianying | Intelligent Game and Decision Lab; Tianjin Artificial Intelliegnc |
Dong, Fang | College of Computer, National University of Defense Technology |
Guo, Xinyu | Beijing Institute of Technology |
Jin, Songchang | Defense Innovation Institute |
Shi, Dianxi | Defense Innovation Institute |
Keywords: Path Planning for Multiple Mobile Robots or Agents, Reinforcement Learning
Abstract: Deep reinforcement learning-based Multi-Agent Path Finding (MAPF) has gained significant attention due to its remarkable adaptability to environments. Existing methods primarily leverage multi-agent communication in a fully-decentralized framework to maintain scalability while enhancing information exchange among agents. However, as the number of agents and obstacles increases, the environment becomes more complex, making cooperation between agents becomes more difficult, and crowding occurs from time to time. To address these issues, we propose a decentralized planner C3PIL, which integrates a Controlled Communication mechanism for Crowd Perception and uses Imitation Learning to improve policy learning. C3PIL first introduces a crowd perception communication module that perceives environmental crowd information and incorporates it into the controlled communication. This effectively prevents and mitigates crowded situations. Furthermore, we employ generative adversarial imitation learning to learn a reward function from expert experiences. It reduces the possible misleading caused by the fixed reward function, improves the flexibility and diversity of agent behaviors, and ultimately enables agents to cooperate effectively. Finally, experimental results show that C3PIL not only outperforms previous learning-based MAPF methods, but also further enhances the cooperation of agents and significantly reduces crowding in complex environments. The code is available at https://github.com/JJingXie/C3PIL.
|
|
WeET21 |
410 |
Integrating Motion Planning and Learning 2 |
Regular Session |
Co-Chair: Soh, Harold | National University of Singapore |
|
16:35-16:40, Paper WeET21.1 | |
TSPDiffuser: Diffusion Models As Learned Samplers for Traveling Salesperson Path Planning Problems |
|
Yonetani, Ryo | CyberAgent |
Keywords: Integrated Planning and Learning, Motion and Path Planning, Autonomous Vehicle Navigation
Abstract: This paper presents TSPDiffuser, a novel data-driven path planner for traveling salesperson path planning problems (TSPPPs) in environments rich with obstacles. Given a set of destinations within obstacle maps, our objective is to efficiently find the shortest possible collision-free path that visits all the destinations. In TSPDiffuser, we train a diffusion model on a large collection of TSPPP instances and their respective solutions to generate plausible paths for unseen problem instances. The model can then be employed as a learned sampler to construct a roadmap that contains potential solutions with a small number of nodes and edges. This approach enables efficient and accurate estimation of travel costs between destinations, effectively addressing the primary computational challenge in solving TSPPPs. Experimental evaluations with diverse synthetic and real-world indoor/outdoor environments demonstrate the effectiveness of TSPDiffuser over existing methods in terms of the trade-off between solution quality and computational time requirements.
|
|
16:40-16:45, Paper WeET21.2 | |
Anticipatory Planning for Performant Long-Lived Robot in Large-Scale Home-Like Environments |
|
Talukder, Md Ridwan Hossain | George Mason University |
Arnob, Raihan Islam | George Mason University |
Stein, Gregory | George Mason University |
Keywords: Integrated Planning and Learning, Task Planning
Abstract: We consider the setting where a robot must complete a sequence of tasks in a persistent large-scale environment, given one at a time. Existing task planners often operate myopically, focusing solely on immediate goals without considering the impact of current actions on future tasks. Anticipatory planning, which reduces the joint objective of the immediate planning cost of the current task and the expected cost associated with future subsequent tasks, offers an approach for improving long-lived task planning. However, applying anticipatory planning in large-scale environments presents significant challenges due to the sheer number of assets involved, which strains the scalability of learning and planning. In this research, we introduce a model-based anticipatory task planning framework designed to scale to large-scale realistic environments. Our framework uses a graph neural network (GNN) in particular via a representation inspired by a 3D scene graph to learn the essential properties of the environment crucial to estimating the state's expected cost and a sampling-based procedure for practical large-scale anticipatory planning. Our experimental results show that our planner reduces the cost of task sequence by 5.38% in home and 31.5% in restaurant settings. If given time to prepare in advance using our model reduces task sequence costs by 40.6% and 42.5%, respectively.
|
|
16:45-16:50, Paper WeET21.3 | |
Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotics Manipulation |
|
Zhu, MinJie | East China Normal University |
Zhu, Yichen | Midea Group |
Li, Jinming | Shanghai University |
Wen, Junjie | East China Normal University |
Xu, Zhiyuan | Midea Group |
Liu, Ning | Midea Group |
Cheng, Ran | Midea Robozone |
Shen, Chaomin | East China Normal University |
Peng, Yaxin | Shanghai University |
Feng, Feifei | Midea Group |
Tang, Jian | Midea Group (Shanghai) Co., Ltd |
Keywords: Imitation Learning, Deep Learning in Grasping and Manipulation
Abstract: Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely textbf{methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our unmasking strategy allows the policy network to enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named methodname, can effectively scale up the model size with improved performance and generalization. We benchmark methodname~across 50 different tasks from MetaWorld and find that our largest methodname~outperforms DP~with an average improvement of 21.6%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 22.5% over DP-T on four single-arm tasks and 66.7% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at href{scaling-diffusion-policy.github.io/}{https://scaling- diffusion-policy.github.io/}.
|
|
16:50-16:55, Paper WeET21.4 | |
Implicit Contact Diffuser: Sequential Contact Reasoning with Latent Point Cloud Diffusion |
|
Huang, Zixuan | University of Michigan |
He, Yinong | University of Michigan |
Lin, Yating | University of Michigan |
Berenson, Dmitry | University of Michigan |
Keywords: Deep Learning in Grasping and Manipulation, Integrated Planning and Learning
Abstract: Long-horizon contact-rich manipulation has long been a challenging problem, as it requires reasoning over both discrete contact modes and continuous object motion. We introduce Implicit Contact Diffuser (ICD), a diffusion-based model that generates a sequence of neural descriptors that specify a series of contact relationships between the object and the environment. This sequence is then used as guidance for an MPC method to accomplish a given task. The key advantage of this approach is that the latent descriptors provide more task- relevant guidance to MPC, helping to avoid local minima for contact-rich manipulation tasks. Our experiments demonstrate that ICD outperforms baselines on complex, long-horizon, contact-rich manipulation tasks, such as cable routing and notebook folding. Additionally, our experiments also indicate that ICD can generalize a target contact relationship to a different environment.
|
|
16:55-17:00, Paper WeET21.5 | |
Diffusion Meets Options: Hierarchical Generative Skill Composition for Temporally-Extended Tasks |
|
Feng, Zeyu | National University of Singapore |
Luan, Hao | National University of Singapore |
Ma, Kevin Yuchen | National University of Singapore |
Soh, Harold | National University of Singapore |
Keywords: Reinforcement Learning, Learning from Demonstration, Hybrid Logical/Dynamical Planning and Verification
Abstract: Safe and successful deployment of robots requires not only the ability to generate complex plans but also the capacity to frequently replan and correct execution errors. This paper addresses the challenge of long-horizon trajectory planning under temporally extended objectives in a receding horizon manner. To this end, we propose DOPPLER, a data-driven hierarchical framework that generates and updates plans based on instruction specified by linear temporal logic (LTL). Our method decomposes temporal tasks into chain of options with hierarchical reinforcement learning from offline non-expert datasets. It leverages diffusion models to generate options with low-level actions. We devise a determinantal-guided posterior sampling technique during batch generation, which improves the speed and diversity of diffusion generated options, leading to more efficient querying. Experiments on robot navigation and manipulation tasks demonstrate that DOPPLER can generate sequences of trajectories that progressively satisfy the specified formulae for obstacle avoidance and sequential visitation.
|
|
17:00-17:05, Paper WeET21.6 | |
PRESTO: Fast Motion Planning Using Diffusion Models Based on Key-Configuration Environment Representation |
|
Seo, Mingyo | The University of Texas at Austin |
Cho, Yoonyoung | KAIST |
Sung, Yoonchang | The University of Texas at Austin |
Stone, Peter | University of Texas at Austin |
Zhu, Yuke | The University of Texas at Austin |
Kim, Beomjoon | Korea Advanced Institute of Science and Technology |
Keywords: Motion and Path Planning, Collision Avoidance, Integrated Planning and Learning
Abstract: We introduce a learning-guided motion planning framework that generates seed trajectories using a diffusion model for trajectory optimization. Given a workspace, our method approximates the configuration space (C-space) obstacles through an environment representation consisting of a sparse set of task-related key configurations, which is then used as a conditioning input to the diffusion model. The diffusion model integrates regularization terms that encourage smooth, collision-free trajectories during training, and trajectory optimization refines the generated seed trajectories to correct any colliding segments. Our experimental results demonstrate that high-quality trajectory priors, learned through our C-space-grounded diffusion model, enable the efficient generation of collision-free trajectories in narrow-passage environments, outperforming previous learning- and planning-based baselines. Videos and additional materials can be found on the project page: https://kiwi-sherbet.github.io/PRESTO.
|
|
17:05-17:10, Paper WeET21.7 | |
Demonstration Data-Driven Parameter Adjustment for Trajectory Planning in Highly Constrained Environments |
|
Lu, Wangtao | Zhejiang University |
Chen, Lei | Beijing Institute of Spacecraft System Engineering |
Wang, Yunkai | Zhejiang University |
Wei, Yufei | Zhejiang University |
Wu, Zifei | Zhejiang University |
Xiong, Rong | Zhejiang University |
Wang, Yue | Zhejiang University |
Keywords: Motion and Path Planning, Learning from Demonstration
Abstract: Trajectory planning in highly constrained environments is crucial for robotic navigation. Classical algorithms are widely used for their interpretability, generalization, and system robustness. However, these algorithms often require parameter retuning when adapting to new scenarios. To address this issue, we propose a demonstration data-driven reinforcement learning (RL) method for automatic parameter adjustment. Our approach includes two main components: a front-end policy network and a back-end asynchronous controller. The policy network selects appropriate parameters for the trajectory planner, while a discriminator in a Conditional Generative Adversarial Network (CGAN) evaluates the planned trajectory, using this evaluation as an imitation reward in RL. The asynchronous controller is employed for high-frequency trajectory tracking. Experiments conducted in both simulation and realworld demonstrate that our proposed method significantly enhances the performance of classical algorithms.
|
|
17:10-17:15, Paper WeET21.8 | |
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring Using Vision-Language Models |
|
Song, Daeun | George Mason University |
Liang, Jing | University of Maryland |
Payandeh, Amirreza | George Mason University |
Raj, Amir Hossain | George Mason University |
Xiao, Xuesu | George Mason University |
Manocha, Dinesh | University of Maryland |
Keywords: Motion and Path Planning, Task and Motion Planning, Integrated Planning and Control
Abstract: We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large training datasets and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 27.38% improvement in the average success rate and 19.05% improvement in the average collision rate in the four social navigation scenarios. Our user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.
|
|
WeET22 |
411 |
Deep Learning for Visual Perception 3 |
Regular Session |
Co-Chair: Hsu, Winston | National Taiwan University |
|
16:35-16:40, Paper WeET22.1 | |
Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension |
|
Guan, Runwei | University of Liverpool |
Zhang, Ruixiao | University of Southampton |
Ouyang, Ningwei | University of Liverpool |
Liu, Jianan | Momoni AI |
Man, Ka Lok | Xi'an Jiaotong-Liverpool University |
Cai, Xiaohao | University of Southampton |
Xu, Ming | Xi'an Jiaotong-Liverpool University |
Smith, Jeremy S. | University of Liverpool |
Lim, Eng Gee | Xi'an Jiaotong-Liverpool University |
Yue, Yutao | Hong Kong University of Science and Technology (Guangzhou) |
Xiong, Hui | The Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Deep Learning for Visual Perception, Data Sets for Robotic Vision, Intelligent Transportation Systems
Abstract: Embodied perception is essential for intelligent vehicles and robots in interactive environmental understanding. However, these advancements primarily focus on vision, with limited attention given to using 3D modeling sensors, restricting a comprehensive understanding of objects in response to prompts containing qualitative and quantitative queries. Recently, as a promising automotive sensor with affordable cost, 4D millimeter-wave radars provide denser point clouds than conventional radars and perceive both semantic and physical characteristics of objects, thereby enhancing the reliability of perception systems. To foster the development of natural language-driven context understanding in radar scenes for 3D visual grounding, we construct the first dataset, Talk2Radar, which bridges these two modalities for 3D Referring Expression Comprehension (REC). Talk2Radar contains 8,682 referring prompt samples with 20,558 referred objects. Moreover, we propose a novel model, T-RadarNet, for 3D REC on point clouds, achieving State-Of-The-Art (SOTA) performance on the Talk2Radar dataset compared to counterparts. Deformable-FPN and Gated Graph Fusion are meticulously designed for efficient point cloud feature modeling and cross-modal fusion between radar and text features, respectively. Comprehensive experiments provide deep insights into radar-based 3D REC. We release our project at https://github.com/GuanRunwei/Talk2Radar.
|
|
16:40-16:45, Paper WeET22.2 | |
Improving Generalization Ability for 3D Object Detection by Learning Sparsity-Invariant Features |
|
Lu, Hsin-Cheng | National Taiwan University |
Lin, Chungyi | National Taiwan University |
Hsu, Winston | National Taiwan University |
Keywords: Deep Learning for Visual Perception, Object Detection, Segmentation and Categorization, Visual Learning
Abstract: In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them—they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird's Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data. The code is available at https://github.com/Tiffamy/3DOD-LSF.
|
|
16:45-16:50, Paper WeET22.3 | |
Camera-Lidar Consistent Neural Radiance Fields |
|
Hou, Chao | The University of Hong Kong |
Zhang, Fu | University of Hong Kong |
Keywords: Deep Learning for Visual Perception, Computer Vision for Automation, Sensor Fusion
Abstract: Neural Radiance Fields (NeRFs) have become a leading technique for novel view synthesis, with promising applications in robotics. However, due to shape-radiance ambiguity, NeRFs often require additional depth inputs for regularization in outdoor scenarios. LiDAR provides accurate depth measurements, but current methods typically combine only a few frames, resulting in sparse depth maps and discrepancies with camera images. The asynchronous nature of LiDAR, where each point is captured at a different timestamp, introduces depth inaccuracies when treated as simultaneous. These errors, along with inherent LiDAR noise, create inconsistencies that hinder reconstruction accuracy. To address these challenges, we propose a continuous-time framework for joint Camera- LiDAR optimization, enabling more consistent radiance field reconstruction and improving both view synthesis and geometric accuracy. To address these issues, we introduce a continuoustime framework for joint Camera-LiDAR optimization, aiming to consistently reconstruct the radiance field for better view synthesis and geometric accuracy.
|
|
16:50-16:55, Paper WeET22.4 | |
Iterative Volume Fusion for Asymmetric Stereo Matching |
|
Gao, Yuanting | Tsinghua University |
Shen, Linghao | Sony (China) Ltd |
Keywords: Deep Learning for Visual Perception, Computer Vision for Transportation, AI-Based Methods
Abstract: Stereo matching is vital in 3D computer vision, with most algorithms assuming symmetric visual properties between binocular visions. However, the rise of asymmetric multi-camera systems (e.g., tele-wide cameras) challenges this assumption and complicates stereo matching. Visual asymmetry disrupts stereo matching by affecting the crucial cost volume computation. To address this, we explore the matching cost distribution of two established cost volume construction methods in asymmetric stereo. We find that each cost volume experiences distinct information distortion, indicating that both should be comprehensively utilized to solve the issue. Based on this, we propose the two-phase Iterative Volume Fusion network for Asymmetric Stereo matching (IVF-AStereo). Initially, the aggregated concatenation volume refines the correlation volume. Subsequently, both volumes are fused to enhance fine details. Our method excels in asymmetric scenarios and shows robust performance against significant visual asymmetry. Extensive comparative experiments on benchmark datasets, along with ablation studies, confirm the effectiveness of our approach in asymmetric stereo with resolution and color degradation.
|
|
16:55-17:00, Paper WeET22.5 | |
OccRWKV: Rethinking Efficient 3D Semantic Occupancy Prediction with Linear Complexity |
|
Wang, Junming | The University of Hong Kong |
Yin, Wei | University of Adelaide |
Long, Xiaoxiao | The University of Hong Kong |
Zhang, Xingyu | Horizon Robotics |
Xing, Zebin | UCAS |
Guo, Xiaoyang | Horizon Robotics |
Zhang, Qian | Horizon Robotics |
Keywords: Deep Learning for Visual Perception, Visual Learning, Computer Vision for Automation
Abstract: 3D semantic occupancy prediction networks have demonstrated remarkable capabilities in reconstructing the geometric and semantic structure of 3D scenes, providing crucial information for robot navigation and autonomous driving systems. However, due to their large overhead from dense network structure designs, existing networks face challenges balancing accuracy and latency. In this paper, we introduce OccRWKV, an efficient semantic occupancy network inspired by Receptance Weighted Key Value (RWKV). OccRWKV separates semantics, occupancy prediction, and feature fusion into distinct branches, each incorporating Sem-RWKV and Geo-RWKV blocks. These blocks are designed to capture long-range dependencies, enabling the network to learn domain-specific representation (i.e., semantics and geometry), which enhances prediction accuracy. Leveraging the sparse nature of real-world 3D occupancy, we reduce computational overhead by projecting features into the bird's-eye view (BEV) space and propose a BEV-RWKV block for efficient feature enhancement and fusion. This enables real-time inference at 22.2 FPS without compromising performance. Experiments demonstrate that OccRWKV outperforms the state-of-the-art methods on the SemanticKITTI dataset, achieving a mIoU of 25.1 while being 20 times faster than the best baseline, Co-Occ, making it suitable for real-time deployment on robots to enhance autonomous navigation efficiency. Code and video are available on our project page: https://jmwang0117.github.io/OccRWKV/.
|
|
17:00-17:05, Paper WeET22.6 | |
ZSORN: Language-Driven Object-Centric Zero-Shot Object Retrieval and Navigation |
|
Guan, Tianrui | University of Maryland |
Yang, Yurou | Amazon |
Cheng, Harry | Amazon |
Lin, Muyuan | Amazon.com LLC |
Kim, Richard | Amazon, Lab126 |
Madhivanan, Rajasimman | Amazon.com |
Sen, Arnab | Amazon |
Manocha, Dinesh | University of Maryland |
Keywords: Deep Learning for Visual Perception, Vision-Based Navigation
Abstract: In this paper, we present ZSORN, a novel language-driven object-centric image representation for object retrieval and navigation task within complex scenes. We propose an object centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.
|
|
17:05-17:10, Paper WeET22.7 | |
PRIDEV: A Plug-And-Play Refinement for Improved Depth Estimation in Videos |
|
Xu, Jing | Peking University |
Liu, Hong | Peking University |
Wu, Jianbing | Peking University |
Xu, Xinhua | Peking University |
Keywords: RGB-D Perception, Deep Learning for Visual Perception
Abstract: Monocular video depth estimation is a key challenge in computer vision, highlighting its importance in visual understanding. Monocular depth estimation models trained on single images achieve impressive results on individual frames but often lack temporal consistency when applied to videos, leading to flickering and artifacts. Current video depth estimation methods often rely on additional optical flow or camera poses, which are limited by their accuracy, carefullydesigned, and lack robustness. Specially, we propose a plug-and-play method that seamlessly transfers the robustness of image depth estimation to video depth estimation. By leveraging powerful priors from image depth estimation, our method enhances the performance of video depth estimation without requiring additional conditional inputs or extensive pretraining on large and expensive video datasets. We introduce the Temporal Depth Stabilization Module (TDSM), which can seamlessly inflate an image monocular depth estimation model into a video depth estimation model, enabling unified modeling of depth across video sequences and capturing the temporal cues in video. We validate the effectiveness and efficiency of our method across various datasets (e.g., normal and challenging conditions) and different backbones. Extensive experiments demonstrate that our simple and effective method significantly improves monocular depth estimation networks, achieving new state-of-the-art accuracy in both spatial and temporal dimensions.
|
|
WeET23 |
412 |
Learning for Control |
Regular Session |
Chair: Manchester, Zachary | Carnegie Mellon University |
Co-Chair: Beckers, Thomas | Vanderbilt University |
|
16:35-16:40, Paper WeET23.1 | |
Unsupervised Meta-Testing with Conditional Neural Processes for Hybrid Meta-Reinforcement Learning |
|
Ada, Suzan Ece | Bogazici University |
Ugur, Emre | Bogazici University |
Keywords: Reinforcement Learning, Deep Learning Methods, Machine Learning for Robot Control
Abstract: We introduce Unsupervised Meta-Testing with Conditional Neural Processes (UMCNP), a novel hybrid few-shot meta-reinforcement learning (meta-RL) method that uniquely combines, yet distinctly separates, parameterized policy gradient-based (PPG) and task inference-based few-shot meta-RL. Tailored for settings where the reward signal is missing during meta-testing, our method increases sample efficiency without requiring additional samples in meta-training. UMCNP leverages the efficiency and scalability of Conditional Neural Processes (CNPs) to reduce the number of online interactions required in meta-testing. During meta-training, samples previously collected through PPG meta-RL are efficiently reused for learning task inference in an offline manner. UMCNP infers the latent representation of the transition dynamics model from a single test task rollout with unknown parameters. This approach allows us to generate rollouts for self-adaptation by interacting with the learned dynamics model. We demonstrate our method can adapt to an unseen test task using significantly fewer samples during meta-testing than the baselines in 2D-Point Agent and continuous control meta-RL benchmarks, namely, cartpole with unknown angle sensor bias, walker agent with randomized dynamics parameters.
|
|
16:40-16:45, Paper WeET23.2 | |
Efficient Online Learning of Contact Force Models for Connector Insertion |
|
Tracy, Kevin | Carnegie Mellon University |
Manchester, Zachary | Carnegie Mellon University |
Jain, Ajinkya | Intrinsic Innovation LLC |
Go, Keegan | Intrinsic Innovation LLC |
Schaal, Stefan | Google X |
Erez, Tom | Google |
Tassa, Yuval | University of Washington |
Keywords: Model Learning for Control, Calibration and Identification, Dexterous Manipulation
Abstract: Contact-rich manipulation tasks with stiff frictional elements, like connector insertion, are difficult to model with rigid-body simulators. In this work, we propose a new approach for modeling these environments by learning a quasi-static contact force model instead of a full simulator. Using a feature vector that contains information about the configuration and control, we find a linear mapping adequately captures the relationship between this feature vector and the sensed contact forces. A novel Linear Model Learning (LML) algorithm is used to solve for the globally optimal mapping in real time without any matrix inversions, resulting in an algorithm that runs in nearly constant time on a GPU as the model size increases. We validate the proposed approach for connector insertion in both simulation and hardware experiments, where the learned model is combined with an optimization-based impedance controller to achieve smooth insertions in the presence of misalignments and uncertainty. Our website featuring videos, code, and more materials is available at https://model-based-plugging.github.io/.
|
|
16:45-16:50, Paper WeET23.3 | |
Flying Quadrotors in Tight Formations Using Learning-Based Model Predictive Control |
|
Chee, Kong Yao | University of Pennsylvania |
Hsieh, Pei-An | University of Pennsylvania |
Pappas, George J. | University of Pennsylvania |
Hsieh, M. Ani | University of Pennsylvania |
Keywords: Model Learning for Control, Machine Learning for Robot Control, Aerial Systems: Mechanics and Control
Abstract: Flying quadrotors in tight formations is a challenging problem. It is known that in the near-field airflow of a quadrotor, the aerodynamic effects induced by the propellers are complex and difficult to characterize. Although machine learning tools can potentially be used to derive models that capture these effects, these data-driven approaches can be sample inefficient and the resulting models often do not generalize as well as their first-principles counterparts. In this work, we propose a framework that combines the benefits of first-principles modeling and data-driven approaches to construct an accurate and sample efficient representation of the complex aerodynamic effects resulting from quadrotors flying in formation. The data-driven component within our model is lightweight, making it amenable for optimization-based control design. Through simulations and physical experiments, we show that incorporating the model into a novel learning-based nonlinear model predictive control (MPC) framework results in substantial performance improvements in terms of trajectory tracking and disturbance rejection. In particular, our framework significantly outperforms nominal MPC in physical experiments, achieving a 40.1% improvement in the average trajectory tracking errors and a 57.5% reduction in the maximum vertical separation errors. Our framework also achieves exceptional sample efficiency, using only a total of 46 seconds of flight data for training across both simulations and physical experiments. Furthermore, with our proposed framework, the quadrotors achieve an exceptionally tight formation, flying with an average separation of less than 1.5 body lengths throughout the flight.
|
|
16:50-16:55, Paper WeET23.4 | |
Learning Based MPC for Autonomous Driving Using a Low Dimensional Residual Model |
|
Li, Yaoyu | Tsinghua University |
Huang, Chaosheng | Tsinghua University |
Yang, Dongsheng | BYD Automotive New Technology Research Institute |
Liu, Wenbo | Tsinghua University |
Li, Jun | Tsinghua University |
Keywords: Model Learning for Control, Machine Learning for Robot Control, Motion Control
Abstract: In this paper, a learning based Model Predictive Control (MPC) using a low dimensional residual model is proposed for autonomous driving. One of the critical challenge in autonomous driving is the complexity of vehicle dynamics, which impedes the formulation of accurate vehicle model. Inaccurate vehicle model can significantly impact the performance of MPC controller. To address this issue, this paper decomposes the nominal vehicle model into invariable and variable elements. The accuracy of invariable elements are ensured by calibration, while the deviations in the variable elements are learned by a low-dimensional residual model. The features of residual model are selected as the physical variables most correlated with nominal model errors. Physical constraints among these features are formulated to explicitly define the valid region within the feature space. The formulated model and constraints are incorporated into the MPC framework and validated through both simulation and real vehicle experiments. The results indicate that the proposed method significantly enhances the model accuracy and controller performance.
|
|
16:55-17:00, Paper WeET23.5 | |
Modeling of Deformable Linear Objects under Incomplete State Information |
|
Klankers, Marc Kilian | Technische Universität Braunschweig |
Steil, Jochen J. | Technische Universität Braunschweig |
Keywords: Model Learning for Control, Machine Learning for Robot Control, Modeling, Control, and Learning for Soft Robots
Abstract: The robot-based tracking of highly dynamic end point motions of deformable linear objects (DLO) remains challenging due to its non-linear behavior. Since simple feedback control is infeasible, model-based control offers potential to account for the non-linear effects, but requires computation efficient and accurate models. Promising results have been achieved utilizing data-driven models that introduce a latent kinematic chain as model of the DLO and mapping measurements of the tip position in its latent joint space, in which the dynamic motion model is learned. So far, this approach has the limitation that it can not handle situations of incomplete sensory information, for instance if occlusion occurs. Consequently, this paper introduces a fusion network architecture capable of making predictions even if sensory information is incomplete. We achieve additional state estimation of the latent joint state by learning a data driven inverse kinematics with help of wrench measurements at the DLO base and evaluate our approach by simulating occlusion. We demonstrate the computational effectiveness of our approach for in the loop control tasks.
|
|
17:00-17:05, Paper WeET23.6 | |
Impedance Primitive-Augmented Hierarchical Reinforcement Learning for Sequential Tasks |
|
Berjaoui Tahmaz, Amin | TU Delft |
Prakash, Ravi | Indian Institute of Science |
Kober, Jens | TU Delft |
Keywords: Reinforcement Learning, Compliance and Impedance Control, Task and Motion Planning
Abstract: This paper presents an Impedance Primitive-augmented hierarchical reinforcement learning framework for efficient robotic manipulation in sequential contact tasks. We leverage this hierarchical structure to sequentially execute behavior primitives with variable stiffness control capabilities for contact tasks. Our proposed approach relies on three key components: an action space enabling variable stiffness control, an adaptive stiffness controller for dynamic stiffness adjustments during primitive execution, and affordance coupling for efficient exploration while encouraging compliance. Through comprehensive training and evaluation, our framework learns efficient stiffness control capabilities and demonstrates improvements in learning efficiency, compositionality in primitive selection, and success rates compared to the state-of-the-art. The training environments include block lifting, door opening, object pushing, and surface cleaning. Real world evaluations further confirm the framework's sim2real capability. This work lays the foundation for more adaptive and versatile robotic manipulation systems, with potential applications in more complex contact-based tasks.
|
|
17:05-17:10, Paper WeET23.7 | |
Plug-And-Play Physics-Informed Learning Using Uncertainty Quantified Port-Hamiltonian Models |
|
Tan, Kaiyuan | Washington University in St.Louis |
Li, Peilun | Vanderbilt University |
Wang, Jun | Washington University in St. Louis |
Beckers, Thomas | Vanderbilt University |
Keywords: Model Learning for Control, AI-Based Methods, Calibration and Identification
Abstract: The ability to predict trajectories of surrounding agents and obstacles is a crucial component in many robotic applications. Data-driven approaches are commonly adopted for state prediction in scenarios where the underlying dynamics are unknown. However, the performance, reliability, and uncertainty of data-driven predictors become compromised when encountering out-of-distribution observations relative to the training data. In this paper, we introduce a Plug-and-Play Physics-Informed Machine Learning (PnP-PIML) framework to address this challenge. Our method employs conformal prediction to identify outlier dynamics and, in that case, switches from a nominal predictor to a physics-consistent model, namely distributed Port-Hamiltonian systems (dPHS). We leverage Gaussian processes to model the energy function of the dPHS, enabling not only the learning of system dynamics but also the quantification of predictive uncertainty through its Bayesian nature. In this way, the proposed framework produces reliable physics-informed predictions even for the out-of-distribution scenarios.
|
|
17:10-17:15, Paper WeET23.8 | |
Robust Proximal Adversarial Reinforcement Learning under Model Mismatch |
|
Zhai, Peng | Fudan University |
Wei, Xiaoyi | Fudan University |
Hou, Taixian | FuDan University |
Ji, Xiaopeng | Zhejiang University |
Dong, Zhiyan | Fudan University |
Yi, Jiafu | Hainan University |
ZHang, Lihua | Fudan University |
Keywords: Reinforcement Learning, Robust/Adaptive Control
Abstract: Reinforcement learning (RL) can generate high-performance control policies for complex tasks in simulation through an end-to-end approach. However, the RL policy is not robust to uncertainties caused by modeling mismatch between simulation and real environments, making it difficult to transfer to the real world. In response to the above challenge, this letter introduces a lightweight and efficient robust RL algorithm. The algorithm transforms the optimization objective of the adversary from a long-term cumulative reward to a short-term reward, making the adversary focus on the performance in the near future. Additionally, the adversarial actions are projected onto a finite subset within the perturbation space using projected gradient descent, effectively constraining the adversary's strength and obtaining more robust policies. Extensive experiments in both simulated and real environments show that our algorithm improves the generalization ability of the policy for the modeling mismatch, outperforming the next best prior methods across almost all environments.
|
| |