The DIRHA English-PHdev Corpus,
baseline and tools for Multi-room distant speech recognition


The Scenario
Distant speech recognition in real-world environments is still a challenging problem: reverberation and dynamic background noise represent major sources of acoustic mismatch that heavily decrease automatic speech recognition (ASR) performance, which, on the contrary, can be very good in close-talking microphone setups. In this context, a particularly interesting topic is the adoption of distributed microphones for the development of voice-enabled automated home environments based on distant-speech interaction. In such a scenario, microphones are installed in different rooms and the resulting multi-channel audio recordings capture multiple audio events, including voice commands or spontaneous speech, generated in various locations and characterized by a variable amount of reverberation as well as possible background noise.

To facilitate a better evaluation of the required algorithms, the DIRHA consortium provides a set of multichannel recordings in a domestic environment based on the newly created DIRHA corpus.

The Dataset
The phonetically-rich part of the DIRHA English Dataset [1,2] is a multi-microphone acoustic corpus being developed under the EC project Distant-speech Interaction for Robust Home Applications (https://dirha.fbk.eu). The corpus is composed of real phonetically-rich sentences recorded with 32 sample-synchronized microphones in a domestic environment.

The database contains signals of different characteristics in terms of reverberation making it suitable for various multi-microphone signal processing and distant speech recognition tasks. The part of the dataset currently released is composed of 4 native US speakers (2 Males, 2 Females) uttering 192 sentences from the Harvard Corpus simultaneously recorded with 32 microphones. All the sentences has been automatically annotated at phone-level and manually check by an expert.

The database is organized as follows:

  • Data/DIRHA_English_phrich: this folder contains the released dataset with the 192 sentences recorded with 32 microfone. For each file, a phone-level annotation (.phn) is released
  • Data/Training_IRs: impulse responses used for contaminating the original TIMIT dataset
  • Data/TIMIT_noise_sequences: noise sequences used for contaminating the original TIMIT dataset
  • Additional Info: contains the information about channels and speakers.

The distributed data are owned by the DIRHA partners and can be used for research purposes only.

Additional Tools
In order to facilitate the use of the dataset for distant speech recognition purposes, users can find additional resources (such as tool and kaldi baselines) on github:

https://github.com/SHINE-FBK/DIRHA_English_phrich

This allows us to keep our baselines and tools updated. More specifically, the current repository contains matlab scripts and kaldi baselines for creating a phone-based distant speech recognizer.

Download
The download of the phonetically-rich part of the DIRHA English Dataset is free for research purposes only. Please, cite the papers below is you use the dataset.
To receive the download link, please fill the licence agreement.

References
[1] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, "The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments", in Proceedings of ASRU 2015.

[2] M. Ravanelli, P. Svaizer, M. Omologo, "Realistic Multi-Microphone Data Simulation for Distant Speech Recognition",in Proceedings of Interspeech 2016.
 
Contacts
Maurizio Omologo ( )
Mirco Ravanelli ( )
Luca Cristoforetti ( )