DocumentCode :
1693740
Title :
Lightly supervised GMM VAD to use audiobook for speech synthesiser
Author :
Mamiya, Yoshitaka ; Yamagishi, Junichi ; Watts, Oliver ; Clark, Robert A. J. ; King, Simon ; Stan, Andrei
Author_Institution :
Centre for Speech Technol. Res., Univ. of Edinburgh, Edinburgh, UK
fYear :
2013
Firstpage :
7987
Lastpage :
7991
Abstract :
Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semi-automatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks.
Keywords :
hidden Markov models; signal detection; speech synthesis; HMM-based speech synthesisers; VAD technique; audiobook; grapheme-based aligner approach; lightly supervised GMM VAD; lightly supervised voice activity detection technique; minimum manual intervention; semiautomatically build TTS systems; sentence boundary detection; speech processing; text data; text-to-speech system training; Buildings; Hidden Markov models; Manuals; Speech; Speech synthesis; Synthesizers; HMM-based speech synthesis; audiobook; lightly supervised; voice activity detection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
Conference_Location :
Vancouver, BC
ISSN :
1520-6149
Type :
conf
DOI :
10.1109/ICASSP.2013.6639220
Filename :
6639220
Link To Document :
بازگشت