Multimodal Language Understanding for Domestic Service Robots


The current rise of life expectancy has increasingly emphasized the need for daily care and support. Robots that can physically assist people with disabilities offer an alternative to overcoming the shortage of home care workers. This context has boosted the need for standardized domestic service robots (DSRs) that can provide necessary support functions. However, one of the main limitations of DSRs is their inability to naturally interact through language. Specifically, most DSRs do not allow users to instruct them with various expressions relating to  daily care tasks. By overcoming this limitation, a user-friendly way to interact with DSRs could be provided to non-expert users.


A Multimodal Classifier Generative Adversarial Network_Trim.mp4

Scientific Contributions

Understanding Ambiguous Commands

We developed several methods to disambiguate users instructions during manipulation tasks. 

For instance ``Put away the milk and cereal.'' is a natural instruction where there is ambiguity on the target area, considering daily life environments. Conventionally, this instruction can be disambiguated from a dialogue system, but at the cost of time and cumbersomeness. Deep learning techniques are used to predict the task feasibility according to HSR physical limitations and the space available on the different target areas. Instead, we propose a multimodal language understanding (MLU) approach, where the instructions are disambiguated from the robot state and environment context. We developed MultiModal Classifier Generative Adversarial Network (MMC-GAN) to predict the likelihood of the different target areas considering the robot physical limitation and the target clutter. A ranking is then based on the feasibility of the task, as illustrated in the below left-hand figure, in the case where the robot should place an object in given environment. These MLU methods will be extended to map objects and the different target areas of the environment. This mapping should characterize the likelihood of an object being on a given piece of furniture or in a given area.

Similarly, the vision and language navigation task can be addressed. For this we developed a causal Transformer (GPT style) to predict a sequence of actions to reach a target destination with instructions such as : “Go down stairs. At the bottom of the stairs walk through the living room and to the right into the bathroom. Stop at the sink."

Three different task solved with multimodal language understanding. Left:  Predicting suitable destination from a placing instruction. Middle: Predicted the targeted object from fetching instructions. Right: Navigation in indoor environments with natural language instructions.

Data Augmentation for multimodal language understanding 

One of the central issue in MLU for DSRs is the lack of large-scale datasets (although nowadays with the strong interest of the GAFA in these tasks, more and more datasets are being released).

We proposed several approach to address this challenge. A first approach consisted in addressing data augmentation in the latent space. Based on Generative Adversarial Network (GAN), we introduced simultaneous data augmentation and classification to multimodal language understanding for manipulation tasks. A generator network is used to mimic the distribution of the actual latent space features. A classification is then performed on both the actual and generated latent features. This  work showed that augmenting data in the latent space can be more efficient than in the raw space.

A second approach consisted in developping an automatic image annotator engine based on image captioning approach as illustrated on the right-hand figure below. Such an engine is very useful for competition such as Robocup@Home or WRS to automatize fetching instructions. The approach is to generate a sentence from a target object by using a LSTM network. 

Generating Fetching instructions on the fly when collecting data

Interpretability of MLU models 

In the context of DSRs, supporting elderly or disabled people, safety is one the important parameter. To allow users to understand and validate beforehand MLU models, it is crucial to be able to interpret these models. Inspired by the the different attention models developed in computer vision, several of our work include attention model to interpret classification or generation results. 

Based on attention branch networks, a first work focused on predicting damaging collision for DSRs when placing objects on a target destination. The proposed DNN model, PonNet, was able to classify and predict beforehand the likelihood of a damaging collision. An attention map could also be extracted from PonNet to "explain" where the likely collision would occur from RGBD data only.

With a similar approach, attention mechanism for fetching tasks or sentence generation could be develop to improve and explain the result obtained.


Using visual attention mechanism for prediction and text generation explainability. Left:  Predicting collision before placing objects Visual attention focused on likely collision. Middle:  Sentence generation with linguistic attention.  Right: Fetching instructed objects with visual attention.

Related Publications