DocumentCode
2009629
Title
Modeling prosody patterns for Chinese expressive text-to-speech synthesis
Author
Wu, Zhiyong ; Cai, Lianhong ; Meng, Helen M.
Author_Institution
Tsinghua-CUHK Joint Res. Center for Media Sci., Tsinghua Univ., Shenzhen, China
fYear
2010
fDate
Nov. 29 2010-Dec. 3 2010
Firstpage
148
Lastpage
152
Abstract
This paper proposes an approach for modeling the prosody patterns of the acoustic features for Chinese expressive text-to-speech (TTS) synthesis. Based on the observation that the speaker usually tends to put more emphasis on one particular syllable within a multi-syllabic prosodic word, we identify such syllable as the core syllable that can be derived from the semantic stress and tone information of the text prompt. We then classify the syllables in speech into four classes, based on their relations with the core syllable in a prosodic word. We analyze the contrastive (neutral versus expressive) speech recordings for each of four classes, and develop a perturbation model that takes into account the prosody pattern to transform neutral speech to expressive speech. Perceptual experiments on both neutral speech recordings and neutral TTS outputs involving 13 subjects indicate that the proposed approach can significantly enhance expressivity in synthesizing expressive speech.
Keywords
natural language processing; speaker recognition; speech synthesis; text analysis; Chinese expressive text-to-speech synthesis; acoustic features; contrastive speech recordings; multisyllabic prosodic word; neutral speech; perturbation model; prosody patterns; semantic stress; speaker; text prompt; Acoustics; Hidden Markov models; Semantics; Speech; Speech synthesis; Stress; expressive text-to-speech (TTS); non-linear perturbaton model; prosody pattern;
fLanguage
English
Publisher
ieee
Conference_Titel
Chinese Spoken Language Processing (ISCSLP), 2010 7th International Symposium on
Conference_Location
Tainan
Print_ISBN
978-1-4244-6244-5
Type
conf
DOI
10.1109/ISCSLP.2010.5684494
Filename
5684494
Link To Document