مرکز منطقه ای اطلاع رساني علوم و فناوري - A speaking rate-controlled Mandarin TTS system

DocumentCode :

1687309

Title :

A speaking rate-controlled Mandarin TTS system

Author :

Chiao-Hua Hsieh ; Yih-Ru Wang ; Chen-Yu Chiang ; Sin-Horng Chen

Author_Institution :

Dept. of Electr. Eng., Nat. Chiao Tung Univ., Hsinchu, Taiwan

fYear :

2013

Firstpage :

6900

Lastpage :

6904

Abstract :

In this paper, a new speaking rate-controlled Mandarin TTS system based on a speaking rate-dependent hierarchical prosodic model (SR-HPM) [6] is proposed. In the training phase, a data-driven approach is employed to automatically build the SR-HPM directly from a large prosody-unlabeled speech database containing utterances of various speaking rates. The SR-HPM comprises 15 sub-models designed to describe various relationships among 3 types of prosodic-acoustic features of speech utterances, two types of prosodic tags specifying a 4-layer prosody hierarchy, linguistic features of various levels of the associated texts, and the speaking rates. In the test phase, the SR-HPM is employed to generate 4 prosodic-acoustic features, including syllable pitch contours, syllable durations, syllable energy levels, and syllable juncture pause durations. Combining these prosodic features with the spectral features generated by the HTS synthesizer, the system can generate natural speech for any speaking rate in a wide range of 0.15-0.3 seconds/syllable. A distinct feature of the system to control the occurrence frequencies of breaks of various types as well as their pause durations according to the given speaking rate was demonstrated. A subjective test showed that MOS scores of 3.35, 3.44 and 3.28 were achieved respectively for fast (SR=0.17 sec/syllable), medium (SR=0.2 sec/syllable) and slow (SR=0.25 sec/syllable) synthetic speeches.

Keywords :

natural language processing; speech synthesis; 4-layer prosody hierarchy; HTS synthesizer; SR-HPM; data-driven approach; linguistic features; prosodic tags; prosodic-acoustic features; prosody-unlabeled speech database; speaking rate-controlled Mandarin TTS system; speaking rate-dependent hierarchical prosodic model; speech utterances; syllable durations; syllable energy levels; syllable juncture pause durations; syllable pitch contours; text-to-speech synthesis; training phase; Databases; Energy states; Hidden Markov models; High-temperature superconductors; Pragmatics; Speech; Training; Mandarin prosody modeling; Speaking rate modeling; Speaking rate-controlled TTS;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on

Conference_Location :

Vancouver, BC

ISSN :

1520-6149

Type :

conf

DOI :

10.1109/ICASSP.2013.6638999

Filename :

6638999

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1687309