IROS 2015 Paper Abstract


Paper ThAT2.1

Kojima, Ryosuke (Tokyo Institute of Technology), Sugiyama, Osamu (Tokyo Institute of Technology), Nakadai, Kazuhiro (Honda Research Inst. Japan Co., Ltd.)

Audio-Visual Scene Understanding Utilizing Text Information for a Cooking Support Robot

Scheduled for presentation during the Regular session "Smart Robotics Application 1" (ThAT2), Thursday, October 1, 2015, 08:30−08:45, Saal D

2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sept 28 - Oct 03, 2015, Congress Center Hamburg, Hamburg, Germany

This information is tentative and subject to change. Compiled on July 19, 2019

Keywords Robot Audition, Recognition


This paper addresses multimodal “scene understanding” for a robot using audio-visual and text information. Scene understanding is defined by extracting six-W information such as What, When, Where, Who, Why, and hoW on the surrounding environment. Although scene understanding for a robot has been studied in the fields of robot vision and audition, only the first four Ws except for why and how information were considered. We, thus, focus on extracting how information, in particular, on cooking scenes. In cooking scenes, we define how information as a cooking procedure, and it is useful that a robot gives appropriate advice for cooking. To realize such cooking support, we propose a multimodal cooking procedure recognition framework consisting of Convolutional Neural Network (CNN), and Hierarchical Hidden Markov Model (HHMM). CNN is knows as one of the most advanced classifiers, and it is applied to recognize a cooking events from audio and visual information. HHMM models a cooking procedure represented by a sequence of cooking events, which is defined as a relationship between cooking events using text data obtained from web, and the cooking events classified with CNN. Therefore, our proposed framework integrates these three types of modalities. We constructed an interactive cooking support system based on the proposed framework, which advice a next step in the current cooking procedure through humanrobot communication. Preliminary results with simulated and real recorded multi-modal scenes showed the robustness of the proposed framework in a noisy and/or occluded situation.



Technical Content © IEEE Robotics & Automation Society

This site is protected by copyright and trademark laws under US and International law.
All rights reserved. © 2002-2019 PaperCept, Inc.
Page generated 2019-07-19  14:02:14 PST  Terms of use