مرکز منطقه ای اطلاع رساني علوم و فناوري - An RNN-based prosodic information synthesizer for Mandarin text-to-speech

DocumentCode :

1365278

Title :

An RNN-based prosodic information synthesizer for Mandarin text-to-speech

Author :

Chen, Sin-Horng ; Hwang, Shaw-Hwa ; Wang, Yih-Ru

Author_Institution :

Dept. of Eng., Nat. Chiao Tung Univ., Hsinchu, Taiwan

Volume :

Issue :

fYear :

1998

fDate :

5/1/1998 12:00:00 AM

Firstpage :

226

Lastpage :

239

Abstract :

A new RNN-based prosodic information synthesizer for Mandarin Chinese text-to-speech (TTS) is proposed in this paper. Its four-layer recurrent neural network (RNN) generates prosodic information such as syllable pitch contours, syllable energy levels, syllable initial and final durations, as well as intersyllable pause durations. The input layer and first hidden layer operate with a word-synchronized clock to represent current-word phonologic states within the prosodic structure of text to be synthesized. The second hidden layer and output layer operate on a syllable-synchronized clock and use outputs from the preceding layers, along with additional syllable-level inputs fed directly to the second hidden layer, to generate desired prosodic parameters. The RNN was trained on a large set of actual utterances accompanied by associated texts, and can automatically learn many human-prosody phonologic rules, including the well-known Sandhi Tone 3 F0-change rule. Experimental results show that all synthesized prosodic parameter sequences matched quite well with their original counterparts, and a pitch-synchronous-overlap-add-based (PSOLA-based) Mandarin TTS system was also used for testing of our approach. While subjective tests are difficult to perform and remain to be done in the future, we have carried out informal listening tests by a significant number of native Chinese speakers and the results confirmed that all synthesized speech sounded quite natural

Keywords :

multilayer perceptrons; recurrent neural nets; speech synthesis; Mandarin Chinese; Mandarin text-to-speech; PSOLA-based Mandarin TTS system; RNN-based prosodic information synthesizer; Sandhi Tone 3 F0-change rule; current-word phonologic states; first hidden layer; four-layer recurrent neural network; human-prosody phonologic rules; informal listening tests; input layer; intersyllable pause durations; output layer; pitch-synchronous-overlap-add-based Mandarin TTS system; prosodic structure; second hidden layer; syllable energy levels; syllable final duration; syllable initial duration; syllable pitch contours; syllable-synchronized clock; synthesized prosodic parameter sequences; synthesized speech; utterances; word-synchronized clock; Acoustic testing; Clocks; Energy states; Loudspeakers; Network synthesis; Performance evaluation; Recurrent neural networks; Speech synthesis; Synthesizers; System testing;

fLanguage :

English

Journal_Title :

Speech and Audio Processing, IEEE Transactions on

Publisher :

ieee

ISSN :

1063-6676

Type :

jour

DOI :

10.1109/89.668817

Filename :

668817

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1365278