HSCMA 2014 Special Session -

Speech detection and speaker localization
in domestic environments


The DIRHA project investigates the adoption of distributed microphone networks and related processing for the development of voice-enabled automated home environments based on distant-speech interaction.
Its main feature is that microphones are installed in different rooms, communicating with each other at acoustic level, since all the doors are open.
The final goal is to process the resulting "microphone network" in the most effective way in order to analyze and interpret a multi-room acoustic scene and consequently, in case of a speech event, recognize the sentence uttered by the speaker.
The scenarios currently investigated encompass typical situations observable in domestic contexts, in terms of speech input as well as of other acoustic events and background noise.
Since all the microphones are far from the sound sources, the majority of the observed events are also characterized by significant reverberation effects, which makes the addressed tasks very challenging.
During the last year, the DIRHA consortium created both simulated data sets and real data sets to train and test a variety of signal processing algorithms. To understand the typical scenes we are trying to process automatically, some audio examples are available at the following links:


real data example

simulated data example


Note that the real data were extracted from Wizard-of-Oz (WOZ) sessions in the Italian language. Each session consisted in a real interaction between a user and the Wizard, the latter one reproducing its output through a loudspeaker installed on the ceiling of the room.

We invite researchers working in the field of multi-microphone signal processing to develop and test their techniques on the DIRHA corpora. This will give a unique opportunity to access the DIRHA data sets and to assess your algorithms in a real-world scenario.


The results of your experimental activities will be presented at this special session of HSCMA 2014.
The researchers who will provide the best results will be invited to give a talk in Lisbon at a forthcoming satellite workshop of EUSIPCO, also related to the dissemination of the DIRHA project.
They will also be invited to contribute to a possible special issue of a journal, or a publication in a Springer book, which will follow that satellite workshop.
We will then refer to the best experimental results in the official deliverables of the DIRHA project, while describing baseline systems realized by labs outside the consortium.
We are thinking to other possible initiatives, which in case will be communicated in the next weeks.


The task is a combination of speech/non-speech detection and speaker localization.
Hence, for each detected speech event the goal is to:

  • provide the corresponding time boundaries,
  • determine the room where it was generated,
  • derive the spatial coordinates of the speaker.

Any other acoustic event must be ignored.

In the case of simulated data, the speaker is stationary while pronouncing a sentence. In the case of real data, the speaker generally changes her/his position in space (although the fluctuations are often very small around a given location).
For both contexts, the multi-microphone recordings are consistent with the layout of the real apartment used as reference under the DIRHA project, and with the related distribution of microphones in space (see the next figure).



More details about the rooms (e.g., size, estimated reverberation time, photos, etc.) and about the microphones (e.g., type and characteristics, coordinates, etc.) are available in the documents that will be delivered with the development and test data sets.
The data sets include a number of scenes (all of 1-minute length in the case of simulated data, and of variable length in the case of real data).
For each scene, a signal at 48 kHz/16 bit is available for each microphone of the following rooms: living-room, kitchen, corridor, bedroom, bathroom. In other words, each scene is described by a set of 40 signals. In the case of real scenes, an additional reference signal (i.e., sys.wav) is also made available which corresponds to the output reproduced by the Wizard through a loudspeaker installed on the ceiling of the living-room or of the kitchen.
More details on the simulated corpus can be found in this page. In particular, the development data set includes 10 scenes for each of the following four languages: Austrian-German, Greek, Italian, Portuguese.

The goal is to develop a system that automatically derives a scene description “best matching” the ground-truth description, in terms of speech/non-speech segment boundaries and of related speaker positions. To this purpose, your system can process all, or a subset of the available microphone signals; there is no restriction from this point of view. Your system can be trained using the development data set as well as any other prior information available in the related documents.
It is also worth noting that the system will have to detect and localize speakers only in the living-room and in the kitchen. In other words, if an event is detected in one of the other three rooms (i.e., corridor, bathroom, and bedroom) it will just have to be classified as interference (and consequently no coordinates of a speaker will be required).


To work on this task, at the beginning you will have access to the development data set.  Please let us know (see contacts below) if you are interested in working on the task, and we will give you instructions on how to have access to that data set. In fact, the access is restricted and can be possible only based on the password you will receive.

We would also kindly invite you to let us know about your interest in this action, once having checked the real and simulated example signals and/or the small data set  available in this page.
Ground truth annotation and evaluation software will also be made available with the development material, as described in this document
The adopted evaluation criteria and metrics are derived from the ones used in the European project CHIL.

Three weeks before the HSCMA deadline for paper submission, you will have access to test data sets in order to run experiments and evaluate the results. To this purpose, you can use the above-mentioned evaluation software as well as the ground-truth annotations that will be made available.
Then, you can submit your results in the following possible ways:
1) a 5-page (regular format) and/or 2-page (demo format) papers, submitted by the HSCMA 2014 deadline (i.e., January 24, 2014), which will be reviewed and included in the proceedings, if accepted;
2) a 2-page late-breaking papers submitted by April 4, 2014, which will give the opportunity to present a poster, but which will not be included in the proceedings.

All submissions shall be made via the HSCMA website.

During the next weeks, this web page will be enriched with more instructions and details. Moreover, we will report on some baseline results we are deriving using the development data set, which can be useful for reference purposes.

In the meantime, please feel free to contact us (Maurizio Omologo, at , and Alessio Brutti, at ) for any information you may need on the access to data sets and on the way to run your experimental activities.



The data sets distributed within this action are owned by the DIRHA partners that created them, and can be used for research purposes only.

More details on ownership and copyright issues are included in the distributed material.