Corpora creation

In an earlier blog post, iTalk2Learn partner, Sail, talked about the statistical nature of the two models which are employed for automatic speech recognition:

  1. The acoustic model
  2. The language model

These models need to be trained on a so-called ‘corpus’ before they can be used in the recognition process.

In this blog post, we’ll hear more from Sail about about the nature and type of ‘corpora’ (plural for ‘corpus’) that are required for model training within iTalk2Learn.


What are corpora?

All corpora are basically sets of data and associated meta-data. In the context of audio recordings, the corpora are the transcripts that correspond to audio recordings themselves.

All meta-data have to be created by human transcriptionists in a meticulous and time-consuming process. Guidelines are developed and then, in a multi-step quality assurance process, transcripts are checked automatically and by further human intervention.

The guidelines may indicate how exactly to transcribe mis-pronunciations, false-starts, stutters or the identification of slang or dialect words. Tools, like XTrans (XTrans) are typically employed by transcriptionists to produce transcripts and time-alignments (indicating the timing behaviour of words, phrases and even individual sounds). Typically, audio corpora comprise several tens to thousands of hours of transcribed speech (and non-speech events, such as ringing phones in the background or breathing noises of a speaker).

iTalk2Learn Corpora creation - Xtrans

The example shows the transcript of an audio file uttered by a student. A transcriptionist listens to the wavfile and transcribes the uttered content according to a pre-defined set of rules. Please note that the transcription includes not only the actual words corresponding to the audio, but also the breathing noises and pronunciations that deviate from standard pronunciation. All of these pieces of information have to be as exact as possible in order to use a corpus for acoustic-model training.

How are corpora usually used?

In the area of broadcast-news transcription – and due to a long series of DARPA-funded research projects in the security domain – large corpora exist for a variety of languages, dialects and audio-settings. Agencies like the Linguistic Data Consortium (LDC) or the Evaluations and Language Resources Distribution Agency (ELDA) aim to market existing corpora for scientific as well as commercial use. Both agencies provide extensive catalogues of speech and language resources.

How are corpora used in iTalk2Learn?

iTalk2Learn deals with the speech of English and German speaking children in the context of maths tutoring in classroom environments. Only very limited data matching the requirements of our project has been made available through public agencies (like the ones mentioned above) over the recent years. ELDA does not provide any adequate corpora for iTalk2Learn purposes.

The LDC provides only the CMU Kids corpus (CMU Kids), containing roughly 9 hours of speech from North American children in the approximate age-bracket considered by iTalk2Learn. The Speech Ark provides a corpus (created within the PF Star project) of (British) English speaking children upon demand (SpeechArk). The University of Erlangen agreed to release a previously unpublished portion of German PF Star specifically for iTalk2Learn when contacted by Sail (for an overview of the PF Star corpus see Batliner at al, 2005).

What problems does this present?

However, all of these resources only span a minimal amount of material (on the order of tens of hours) and are not optimal for model training. Due to this scarcity of publicly available data, a corpus collection effort for both languages – British English and German German) was launched within iTalk2Learn. The amount of usable training data coming out of this effort will likewise be relatively small – due to the limitation of available students and resources.

However, given the very limited amount of currently publicly available data, we hope to be able to add a useful set of resources for the training and adaptation of acoustic models for automatic speech recognition of children’s voices not only for iTalk2Learn but also – through the aim of publishing the generated corpora – also for further scientific and commercial activities.

If you would like to find out more about the use of corpora within iTalk2Learn, you can get in touch with the consortium any time.


Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D., Gerosa, M., Hacker, Ch., Russell, M., Steidl, S., Wong, M., The PF STAR Children’s Speech Corpus, Interspeech 2005


XTRans: https://www.ldc.upenn.edu/language-resources/tools/xtrans on 2014/07/30
LDC: https://www.ldc.upenn.edu/ on 2014/07/30
ELDA: http://www.elda.org/ on 2014/07/30
CMU Kids: https://catalog.ldc.upenn.edu/LDC97S63 on 2014/07/30
Speech Ark: http://www.thespeechark.com/pf-star-page.html on 2014/07/30