Multimodal Language Understanding for Domestic Service Robots

Context

The current rise of life expectancy has increasingly emphasized the need for daily care and support. Robots that can physically assist people with disabilities offer an alternative to overcoming the shortage of home care workers. This context has boosted the need for standardized domestic service robots (DSRs) that can provide necessary support functions. However, one of the main limitations of DSRs is their inability to naturally interact through language. Specifically, most DSRs do not allow users to instruct them with various expressions relating to daily care tasks. By overcoming this limitation, a user-friendly way to interact with DSRs could be provided to non-expert users.

Challenges

Natural language instructions related to daily care tasks raise several challenges related to the ambiguity of the instructions; i.e., the relevant information may be truncated, hidden, or expressed in several ways. The many-to-many nature of mapping between the language and real world makes it difficult to accurately predict user intention. Thus, an inference of users’ intent from only partial information is needed. Grounding a user's intention requires linguistic inputs but also proprioceptive senses (e.g., vision) and contextual knowledge.
Data-driven approaches are effective for handling ambiguous instructions. Unfortunately, these methods require large annotated datasets that can be tedious to create manually. The main reason is the time that is required for human experts to provide sentences for images. As a result, today sophisticated deep neural networks are not fully utilized in robotics because most robotic data are small and custom-made, which easily leads to overfitting. In addition to resulting in networks with limited complexity and size that are difficult to generalize, this is an undeniable obstacle to the development of data-driven methods in robotics. Hence, methods to automatically augment or generate instructions automatically could drastically reduce this cost and alleviate the burden of labeling from human experts.
Robots should not only understand the operator’s intent, but they should also be able to communicate in a comprehensive way about their actions, decisions, and intents. Indeed, bidirectional communication favors interaction and trust towards robots. Paradoxically, current neural network models are not adapted to this paradigm, as they are inherently black boxes.

A Multimodal Classifier Generative Adversarial Network_Trim.mp4

Scientific Contributions

Understanding Ambiguous Commands

We developed several methods to disambiguate users instructions during manipulation tasks.

For instance ``Put away the milk and cereal.'' is a natural instruction where there is ambiguity on the target area, considering daily life environments. Conventionally, this instruction can be disambiguated from a dialogue system, but at the cost of time and cumbersomeness. Deep learning techniques are used to predict the task feasibility according to HSR physical limitations and the space available on the different target areas. Instead, we propose a multimodal language understanding (MLU) approach, where the instructions are disambiguated from the robot state and environment context. We developed MultiModal Classifier Generative Adversarial Network (MMC-GAN) to predict the likelihood of the different target areas considering the robot physical limitation and the target clutter. A ranking is then based on the feasibility of the task, as illustrated in the below left-hand figure, in the case where the robot should place an object in given environment. These MLU methods will be extended to map objects and the different target areas of the environment. This mapping should characterize the likelihood of an object being on a given piece of furniture or in a given area.

Similarly, the vision and language navigation task can be addressed. For this we developed a causal Transformer (GPT style) to predict a sequence of actions to reach a target destination with instructions such as : “Go down stairs. At the bottom of the stairs walk through the living room and to the right into the bathroom. Stop at the sink."

Three different task solved with multimodal language understanding. Left: Predicting suitable destination from a placing instruction. Middle: Predicted the targeted object from fetching instructions. Right: Navigation in indoor environments with natural language instructions.

Data Augmentation for multimodal language understanding

One of the central issue in MLU for DSRs is the lack of large-scale datasets (although nowadays with the strong interest of the GAFA in these tasks, more and more datasets are being released).

We proposed several approach to address this challenge. A first approach consisted in addressing data augmentation in the latent space. Based on Generative Adversarial Network (GAN), we introduced simultaneous data augmentation and classification to multimodal language understanding for manipulation tasks. A generator network is used to mimic the distribution of the actual latent space features. A classification is then performed on both the actual and generated latent features. This work showed that augmenting data in the latent space can be more efficient than in the raw space.

A second approach consisted in developping an automatic image annotator engine based on image captioning approach as illustrated on the right-hand figure below. Such an engine is very useful for competition such as Robocup@Home or WRS to automatize fetching instructions. The approach is to generate a sentence from a target object by using a LSTM network.

Generating Fetching instructions on the fly when collecting data

Interpretability of MLU models

In the context of DSRs, supporting elderly or disabled people, safety is one the important parameter. To allow users to understand and validate beforehand MLU models, it is crucial to be able to interpret these models. Inspired by the the different attention models developed in computer vision, several of our work include attention model to interpret classification or generation results.

Based on attention branch networks, a first work focused on predicting damaging collision for DSRs when placing objects on a target destination. The proposed DNN model, PonNet, was able to classify and predict beforehand the likelihood of a damaging collision. An attention map could also be extracted from PonNet to "explain" where the likely collision would occur from RGBD data only.

With a similar approach, attention mechanism for fetching tasks or sentence generation could be develop to improve and explain the result obtained.

Fetching_attention.mp4

Using visual attention mechanism for prediction and text generation explainability. Left: Predicting collision before placing objects Visual attention focused on likely collision. Middle: Sentence generation with linguistic attention. Right: Fetching instructed objects with visual attention.

Related Publications

Aly Magassouba, Komei Sugiura, Angelica Nakayama, Tsubasa Hirakawa, Takayoshi Yamashita,Hironobu Fujiyoshi, and Hisashi Kawai. Predicting and attending to damaging collisions for placing everyday objects in photo-realistic simulations. Advanced Robotics, volume 35, pages 787–799. Taylor & Francis, 2021
Aly Magassouba, Komei Sugiura, and Hisashi Kawai. Crossmap transformer: A crossmodal masked path transformer using double back-translation for vision-and-language navigation. IEEE Robotics and Automation Letters, volume 6, pages 6258–6265. IEEE, 2021.
Tadashi Ogura, Aly Magassouba, Komei Sugiura, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, and Hisashi Kawai. Alleviating the burden of labeling: Sentence generation by attention branch encoder–decoder network. IEEE Robotics and Automation Letters, volume 5, pages 5945–5952. IEEE, 2020.
Aly Magassouba, Komei Sugiura, and Hisashi Kawai. Multimodal attention branch network for perspective-free sentence generation. In Conference on Robot Learning, pages 76–85. PMLR, 2020
Aly Magassouba, Komei Sugiura, and Hisashi Kawai. A multimodal target-source classifier with attention branches to understand ambiguous instructions for fetching daily objects. IEEE Robotics and Automation Letters, volume 5, pages 532–539. IEEE, 2020
Aly Magassouba, Komei Sugiura, and Hisashi Kawai. Latent-space data augmentation for visually-grounded language understanding. In Advances in Artificial Intelligence: Selected Papers from the Annual Conference of Japanese Society of Artificial Intelligence 33, pages 179–187. Springer International Publishing, 2020
Aly Magassouba, Komei Sugiura, Anh Trinh Quoc, and Hisashi Kawai. Understanding natural language instructions for fetching daily objects using gan-based multimodal target–source classification. IEEE Robotics and Automation Letters, volume 4, pages 3884–3891. IEEE, 2019
Aly Magassouba, Komei Sugiura, and Hisashi KAWAI. A multimodal target-source classifier model for object fetching from natural language instructions. In Proceedings of the Annual Conference of JSAI 33rd (2019), pages 2D3E403–2D3E403. The Japanese Society for Artificial Intelligence, 2019
Aly Magassouba, Komei Sugiura, and Hisashi Kawai. A multimodal classifier generative adversarial network for carry and place tasks from ambiguous language instructions. IEEE Robotics and Automation Letters, volume 3, pages 3113–3120. IEEE, 2018

Google Sites

Report abuse