The DIRHA English-PHdev Corpus,
baseline and tools for Multi-room distant speech recognition

The Scenario
Distant speech recognition in real-world environments is still a challenging problem: reverberation and dynamic background noise represent major sources of acoustic mismatch that heavily decrease automatic speech recognition (ASR) performance, which, on the contrary, can be very good in close-talking microphone setups. In this context, a particularly interesting topic is the adoption of distributed microphones for the development of voice-enabled automated home environments based on distant-speech interaction. In such a scenario, microphones are installed in different rooms and the resulting multi-channel audio recordings capture multiple audio events, including voice commands or spontaneous speech, generated in various locations and characterized by a variable amount of reverberation as well as possible background noise.

To facilitate a better evaluation of the required algorithms, the DIRHA consortium provides a set of multichannel recordings in a domestic environment based on the newly created DIRHA corpus.

The Dataset
The phonetically-rich part of the DIRHA English Dataset [1,2] is a multi-microphone acoustic corpus being developed under the EC project Distant-speech Interaction for Robust Home Applications ( The corpus is composed of real phonetically-rich sentences recorded with 32 sample-synchronized microphones in a domestic environment.

The database contains signals of different characteristics in terms of reverberation making it suitable for various multi-microphone signal processing and distant speech recognition tasks. The part of the dataset currently released is composed of 4 native US speakers (2 Males, 2 Females) uttering 192 sentences from the Harvard Corpus simultaneously recorded with 32 microphones. All the sentences has been automatically annotated at phone-level and manually check by an expert.

The database is organized as follows:

  • Data/DIRHA_English_phrich: this folder contains the released dataset with the 192 sentences recorded with 32 microfone. For each file, a phone-level annotation (.phn) is released
  • Data/Training_IRs: impulse responses used for contaminating the original TIMIT dataset
  • Data/TIMIT_noise_sequences: noise sequences used for contaminating the original TIMIT dataset
  • Additional Info: contains the information about channels and speakers.

Additional Tools
In order to facilitate the use of the dataset for distant speech recognition purposes, users can find additional resources (such as tool and kaldi baselines) on github:

This allows us to keep our baselines and tools updated. More specifically, the current repository contains matlab scripts and kaldi baselines for creating a phone-based distant speech recognizer.

The download of the phonetically-rich part of the DIRHA English Dataset is free for research purposes only. Please, cite the papers below is you use the dataset.
To receive the download link, please fill the licence agreement.
Please fill in all the slots with information that clearly demonstrates you are going to use this data set for research purposes in fields such as multi-microphone signal processing and distant-speech recognition. The data set was built and annotated for that purpose.

We do not release it for private initiatives and commercial purposes, as agreed with European Commission. For requests that do not match the latter requirements, the FBK Contract Office will be involved in the process of defining an agreement with you, before eventually give you access to the data set.

[1] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, "The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments", in Proceedings of ASRU 2015.

[2] M. Ravanelli, P. Svaizer, M. Omologo, "Realistic Multi-Microphone Data Simulation for Distant Speech Recognition",in Proceedings of Interspeech 2016.
Maurizio Omologo ( )
Mirco Ravanelli ( )
Luca Cristoforetti ( )