Control of prosodie focus in corpus-based generation of fundamental frequency contours based on the generation process model

Author

Hirose, Keikichi ; Ochi, Keiko ; Minematsu, Nobuaki

Author_Institution

Dept. of Inf. & Commun. Eng., Univ. of Tokyo, Tokyo, Japan

fYear

2010

fDate

24-28 Oct. 2010

Firstpage

629

Lastpage

632

Abstract

HMM-based speech synthesis is known to be a possible solution for realizing "flexibility" in speech synthesis. However, its frame-by-frame process of acoustic features is not appropriate for prosodic features. Prosodic features cover a wider time span as compared to segmental features, and should be handled differently. From this point of view, a method has been developed for generating sentence F₀ contours based on the generation process model, which models sentence F₀ contours in logarithmic scale as super-positions of phrase and accent components. These components are further represented as responses of discrete commands, which have tight relations with linguistic and para-/non-linguistic information of sentences. By predicting the model commands instead of frame-by-frame F₀ values, a flexible and robust F₀ control can be realized. As an example of flexible control, a method is developed for generating sentence F₀ contours of Japanese, when a focus is placed in one of the “bunsetsu\´s” of an utterance. The method first predicts differences in the F₀ model commands between utterances with and without focus, and then applies them to the F₀ model commands predicted beforehand by the baseline method without focus assignment. The baseline method is trained using a large corpus, while corpus for training command differences can be small and not necessarily be uttered by the same speaker of the large corpus. The validity of the method was proved by the experiment on F₀ contour generation and speech synthesis, including interpolation/extrapolation of the F₀ model commands for focus level control.

Keywords

acoustic focusing; extrapolation; hidden Markov models; interpolation; speech synthesis; HMM-based speech synthesis; accent components; acoustic features; contour generation; corpus-based generation; discrete commands; extrapolation; focus assignment; focus level control; frame-by-frame process; fundamental frequency contours; generation process model; interpolation; logarithmic scale; model commands; prosodic features; prosodic focus; segmental features; training command differences; Hidden Markov models; Pragmatics; Predictive models; Process control; Speech; Speech synthesis; Training; Corpus-based method; F0 contour; Generation process model; Prosodie focus; Speech synthesis;

fLanguage

English

Publisher

ieee

Conference_Titel

Signal Processing (ICSP), 2010 IEEE 10th International Conference on

Conference_Location

Beijing

Print_ISBN

978-1-4244-5897-4

Type

conf

DOI

10.1109/ICOSP.2010.5656835

Filename

5656835