Robot Audition


Towards the dream of autonomous human-robot interactions, a robot endowed with a reliable hearing sense is essential. Nowadays, a hearing sense is generally composed of auditory capabilities such as sound localization, along with sound separation, speech processing and auditory scene analysis, that build up the topic of robot audition. In this research area, a particular attention has been given to sound source localization, certainly because it is the first step of auditory-based interaction. The typical approach of sound localization consists in transposing the knowledge in psychoacoustics and physiology into an artificial hearing system that extracts auditory cues. The settings generally assume that the sensors and sources of interest are static. The auditory cues such as the interaural time difference and the interaural level difference or the spectral notches, provide information about the direction of arrival of the sound in the horizontal and the vertical plane. 


The efficiency of the localization process depends exclusively on the accuracy of the extracted auditory cues. However mimicking the human hearing system happens to be complex in realistic configurations. Perception and more particularly auditory cues are influenced by the shape of the body (head, pinna, torso...) along with the acoustic conditions (acoustic of the room, noise, reverberation). The sensitivity to auditory events is variable for each individual and each location. As a consequence, even the most complex artificial auditory systems can be deployed in controlled environment only. Real world configurations, that include dynamic scenes, reverberation, noises, degrade drastically the accuracy of the auditory cues and at the same time the localization performance. As a result, sound source localization considering binaural setup was not achievable in real-world environments.

Scientific Contributions

My work adressed motion strategies linked to the auditory perception, in a binaural context, with a central focus on the following question Do robots need to localize sound sources to engage into an interaction ? As an answer of this question, my work defined auditory-based interactions as tasks modelled in a sensor-based framework. This framework, that I defined as aural servo (AS), does not localize sound sources. A task is characterized by a state of measurements to satisfy, that generally corresponds to a particular pose of the robot. For instance, in AS, a homing task consists in positioning the robot for satisfying given conditions defined by a set of auditory measurements, while a localization-based approach requires to extract the bearing angle and the distance to the sound source, before moving the robot to the correct location. In the AS paradigm, the motion of the robot is generated in real time, through a control loop based on auditory cues measured on the fly. Velocity inputs of the robot are obtained by modeling the relationship between the variation of given auditory cues and of the sensors motion. In this way, my approach principally relies on the variation of these cues instead of their intrinsic values. Consequently the localization step is skipped and the modeling of the adverse acoustic conditions or the effect of the body/head are not required to be modeled. Hence, it is even possible to complete tasks with erroneous auditory cues as long as the variation of these cues remains consistent.

Typical auditory cues

The main auditory are respectively the interaural  time difference (ITD) and the interaural level difference (ILD) that are respectively generated by the delay and attenuation of a given sound wave between the two ears. Both ILDs and ITDs are related to the sound direction of arrival in the azimuth plane. ITDs and ILDs are complementary cues that are relevant in different ranges of frequencies. Actually, ILDs are significant for frequencies above approximately 1.5 kHz, that is to say when the head is large compared to the wavelength  of an incoming sound. Conversely ITDs are meaningful for low frequencies.

The ILD is caused by the difference of sound pressure between the two ears while the ITD reflects the phase difference between the signal perceived by these ears

Basics of sensor-based control

Sensor-based control is designed by a closed-loop control that consists in a dynamic "sense and move" approach. The control input dot(q) to a given robot is steered by the dynamic feedback of sensor measurements s(r) and generates a motion until a pose characterized by s(r) = s(r), is reached. Formally, the goal of a sensor-based task consists in controlling the system by regulating the error between the observed and desired values s(r) - s(rd) to 0. The core principle to perform the latter regulation lies in controlling a robot from a C2-differentiable error function e(q; t), that has to be regulated during a time interval [0; T].

A sensor-based approach links the motion of the robot to the sensor measurement s(r) in a feedback loop until the robot reaches a desired configuration characterized by a demanded measurement s.

Based on the task function approach  the relation between the features s and the robot velocity is 

Designing a task regulation mainly consists in i) selecting s, ii) computing or measuring s and iii) modeling the task Jacobian Js (via the interaction matrix Ls) while ensuring that the admissibility conditions of the task are not violated. 

Aural Servo 

In the ideal case,  the inverse-square law property inherent to spherical and uniform sound propagation the ILD can be simplified as the squared ratio of the distance l1 and l2. This assumes that the perceived sound signal does vary little between the two microphones. By deriving the above ratio, an approximate interaction matrix can be computed as:

The ITD  tau, can be computed from the direction of arrival (DoA) alpha: tau = A cos (alpha), with A = d/c. The interaction matrix related to the ITD tau is then given by

Setup with two microphones and a sound source


Basic applications of Aural Servo. 


Advanced application  of Aural Servo with several features

Aural Servo on Humanoid Robots

Binaural localization on humanoid robots is the closest configuration to biological auditory systems, but at the same time probably the most challenging configuration for robot audition. Similarly to human listeners, the head-related transfer function (HRTF) can shadow, reflect or refract sounds on humanoid robots. Modeling HRTFs is a big concern in robot audition since the relationship between the sound direction and the interaural cues is modified in a non linear way. The induced scattering and sound spectral modifications particularly affect the localization performance. Indeed the filtering effect of the robot structure is reflected in the sound level attenuation and delays. Typical studies on humanoid robots are based on the knowledge of HRTFs, to determine the azimuth and the elevation of the sound source, in controlled conditions. In fact, each robot has a unique HRTF that cannot be transferred to other platforms. 

Instead, my work proved that it is not necessary to know the HRTF of each robot. The core advantage of AS is certainly the fact that my approach is mainly based on the variation of auditory features. As long as the variation of the selected features is consistent with robot motion, the desired task can be correctly achieved. As shown in the videos below, it is then possible to control the gaze of robots from Aural Servo without any knowledge of their HRTFs. 


Romeo gaze  control with Aural Servo


Head-turn reflex in HRI with Aural Servo (Sound On)

Related Publications