Title :
Using prompts to produce quality corpus for training automatic speech recognition systems
Author :
Lecouteux, Benjamin ; Linarès, Georges
Author_Institution :
Lab. Inf. d´´Avignon (LIA), Univ. of Avignon, Avignon
Abstract :
In this paper we present an integrated unsupervised method to produce a quality corpus for training automatic speech recognition system (ASR) using prompts or closed captions. Closed captions and prompts do not always have timestamps and do not necessarily correspond to the exact speech. We propose a method allowing to extract quality corpus from imperfect transcript. The proposed approach works in two steps. During the search, the ASR system finds matching segments in a large prompt database. Matching segments are then used inside a driven decoding algorithm (DDA) to produce a high quality corpus. Results show a F-measure of 96% in term of spotting while the DDA corrects the output according to the prompts: a high quality corpus is easily extracted.
Keywords :
decoding; feature extraction; speech coding; speech recognition; unsupervised learning; automatic speech recognition systems; driven decoding algorithm; high quality corpus extraction; integrated unsupervised method; Abstracts; Automatic speech recognition; Costs; Databases; Decoding; Error analysis; Guidelines; Machine assisted indexing; Speech recognition; Transducers; automatic segmentation; closed captioning; corpus building; speech recognition;
Conference_Titel :
Electrotechnical Conference, 2008. MELECON 2008. The 14th IEEE Mediterranean
Conference_Location :
Ajaccio
Print_ISBN :
978-1-4244-1632-5
Electronic_ISBN :
978-1-4244-1633-2
DOI :
10.1109/MELCON.2008.4618540