Towards a multilingual prosody model for text-to-speech

Author

Jokisch, Oliver ; Ding, Hongwei ; Kruschke, Hans

Author_Institution

Dresden University of Technology, Laboratory of Acoustics and Speech Communication, 01062, Germany

Volume

1

fYear

2002

fDate

13-17 May 2002

Abstract

The generation of prosodic parameters such as F0 contour, duration and intensity still remains an important issue for naturally-sounding text-to-speech (TTS), although recently developed TTS systems have achieved a considerable progress. Several appropriate but language-specific rule-based, statistical or data-driven prosody models have been successfully realized in many systems. The language and parameter dependent models lead to a more complex and inefficient TTS system design. In earlier works the authors proposed a hybrid data-driven and rule-based model, which can adjust different voices or speaking styles by learning and predicting prosodic parameters. The current paper discusses the multilingual model generalization and the design of appropriate prosodic databases. Exemplary, two different languages: German and Mandarin Chinese are examined. Prediction results and perceptual evaluation with respect to F0 contours and duration values are presented. Since the perceptual results of both languages are comparable and quite satisfying, the model is qualified for the multilingual prosody control. Resynthesis stimuli obtained from modified prosodic parameters partly achieve near-to-natural mean opinion scores (MOS) above 4.0. The introduced hybrid data-driven and rule-based model is comparatively simple and enables a multilingual prosody control in TTS.

Keywords

Shape; Speech; Training;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on

Conference_Location

Orlando, FL, USA

ISSN

1520-6149

Print_ISBN

0-7803-7402-9

Type

conf

DOI

10.1109/ICASSP.2002.5743744

Filename

5743744