The DIRHA simulated corpus

A multi-microphone multi-language simulation corpus (referred to as DIRHA SimCorpus in the following of this page) is being developed under the EC project Distant-speech Interaction for Robust Home Applications (DIRHA) (see more details under http://dirha.fbk.eu). An excerpt containing 6 multi-microphone simulations for the Italian language has been made available for download (see the link at the end of this page).
Further information about possible download of a larger set of sequences will be made available under the project web site, with reference to an initiative related to a special session of the HSCMA 2014 workshop (see
 http://hscma2014.inria.fr).

The simulated corpus

The DIRHA SimCorpus (Cristoforetti,2014) is a multi-microphone and multi-language database containing simulated acoustic sequences derived from a microphone-equipped apartment (referred as ITEA apartment) available under the DIRHA project.
For each language, the corpus contains a set of acoustic scenes of duration 60 seconds, at 48kHz sampling frequency and 16-bit accuracy, observed by 40 microphones distributed over 5 different rooms, as shown in Figure 1.

   1.  Multi-Microphone set-up
The overall multi-microphone setup is based on both distributed microphone networks (pairs or triplets of sensors on the walls) and more compact microphone arrays (on the ceiling of the living-room and the kitchen).
In each pair, the sensors are at a distance of 30 cm, while triplets are based on microphones at 15 cm of distance. The arrays are composed of six microphones, five of which have been placed on a circumference of radius 30 cm, while one sensor has been arranged on the center of the circle.

   2.  Simulation composition
Each 1-minute simulation includes the following acoustic events occurring in different possible time-instants, rooms and positions:

  • A keyword followed by a command;
  • A spontaneous command (without the keyword);
  • A phonetically rich sentence;
  • A segment of conversational speech;
  • A variable number of localized non-speech sources (e.g., radio, TV, appliances, knocking, ringing, creaking and many others).

Multi-microphone background noise recorded in the real environment has also been added to each simulation.
Each simulated acoustic sequence has been replicated in four languages (Table 1) while preserving the same background noise and non-speech sources. Gender and timing of the active speakers have been preserved across the different languages, in order to ensure homogeneity.
The first release, described in Table 1, includes at least two data-sets of 75 sequences for each language. Each set (dev1, test1, test2) is composed by a selection of 10 different speakers.

 

Figure 1: The ITEA recording apartment, black  dots represent microphones, coloured boxes represent loudspeaker positions and orientations.

Language

Dev1

Test1

Test2

Total

ITALIAN

75

75

75

225

GERMAN

-

75

75

150

GREEK

-

75

75

150

PORTUGUESE

-

75

75

150

Table 1: Composition of the DIRHA SimCorpus. Each simulated acoustic scene lasts 60 seconds and has been composed of several speech and non-speech sources distributed over five different rooms.

   3Contamination Process
The whole simulation process, performed by a MATLAB tool developed in FBK, is depicted in Figure 2.

Figure 2: Basic scheme of the simulation process (Matassoni, 2002). A set of dry acoustic sources (speech or typical home noise sequences) has been selected from the available clean corpus. For each source, a random position in space has been chosen, and then a convolution between the clean signal and the proper set of multi-microphone IRs is performed to account for the room acoustics. To increase the realism of the acoustic scene, real background noise sequences of different dynamics have also been added.

The IRs measurement process explored several different positions and orientations of the speaker over the various rooms of the apartment, in order to generate simulated data with a satisfactory level of richness in terms of spatial variability. As shown in Table 2, more than 9000 sample-synchronized IRs have been measured. Cross-room impulse responses are also included in order to simulate sources in other rooms.

Room

Installed
microphones

Available
positions

Measured
IRs

T60 (seconds)

LIVINGROOM

15

18

2960

0.74

KITCHEN

13

18

2960

0.83

BEDROOM

7

14

2160

0.68

BATHROOM

3

4

640

0.75

CORRIDOR

2

3

480

0.60

Table 2: Measured Impulse Responses

The IRs measurements were based on a professional studio monitor (Genelec 8030A) able to excite the target environment with long sequences of Exponential Sine Sweep (ESS) signal (Farina, 2000).  As pointed out in (Ravanelli, 2012), ESS method ensures IRs measurements with a high SNR and a remarkable robustness against harmonic distortions. The Time of Flight (TOF) information, crucial to applications such as acoustic event localization, beam-forming, multi-microphone signal processing and speech enhancement, has been preserved by means of six sample-synchronized multi-channel audio cards (RME Octamic II). The measured IRs are at 48kHz sampling frequency with 24-bit accuracy.

   4.  Annotations
The signal of each microphone is accompanied with an XML-like annotation file which describes in detail the content of the acoustic sequence. It reports the exact timing of each single acoustic event observed at that microphone, its typology, the coordinates of the source, an estimate of the SNR and of the reverberation time. Speech acoustic events are also transcribed with timing at word level. Additional information indicates the intervals of overlapping of multiple events and the intervals of effective activity of each source.
        

   5.  Possible applications
First  tests demonstrated the usefulness of the corpus for experiments of source localization, acoustic event detection, echo cancellation and speech/non-speech discrimination. At the moment, an experimental activity is under way to establish baseline systems that could be used in the future for benchmarking and international challenges.

For more information please contact: mravanelli@fbk.eu, omologo@fbk.eu.

 

Bibliography

M. Matassoni, M. Omologo, D. Giuliani, P. Svaizer, “HMM Training with Contaminated Speech Material for Distant-Talking Speech Recognition”, in COMPUTER SPEECH AND LANGUAGE, 2002.

A. Farina, “Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique”, 110th AES Convention, February 2000.

M. Ravanelli, A. Sosi, P. Svaizer, M. Omologo, “Impulse response estimation for robust speech recognition in a reverberant environment”, EUSIPCO 2012.

L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad, M. Hagmueller, P. Maragos,  “The DIRHA simulated corpus”, LREC 2014.

 

Download link

A single 60-second example of a simulation can be downloaded from this link. Size is 5.5MB.

Data can be downloaded from this link. File size is 1GB and contains six simulations, as well as the related documentation.