Synthesis of emotional speech using RP-PSOLA

Author

Vine, Daniel S G ; Sahandi, Reza

fYear

2000

fDate

2000

Firstpage

42583

Lastpage

42588

Abstract

Whilst TD-PSOLA remains an adequate solution for neutral speech synthesis, it is less suitable for emotional speech styles, which require more extreme pitch manipulation. By reducing the extent of the necessary pitch manipulation, distortions and artefacts introduced by TD-PSOLA could potentially be lessened. To accomplish this, a method for recording concatenative units with f₀ values similar to the target intonation has been devised. This technique, termed reference pitch prompted recording, involves a speaker recording concatenative units at a set pitch. The speaker is guided by a `reference pitch prompt´ (RPP), which is a monotonic, hummed note. In RP-PSOLA (reference pitch-PSOLA) synthesis, RPP-recorded units such as syllables are concatenated and an intonation contour applied using TD-PSOLA. RP-PSOLA can be extended so that several versions of each syllable are recorded, each at a different pitch. In this synthesis technique, termed multiple pitch RP-PSOLA, syllables are selected from an inventory to approximate to the target f₀ contour and concatenated. This paper compares the RP-PSOLA and multiple pitch RP-PSOLA synthesis methods in terms of the perceived distortion in emotional synthetic sentences, via a listening experiment. The results showed that multiple pitch RP-PSOLA is perceived to produce marginally less distorted synthetic speech than RP-PSOLA overall

fLanguage

English

Publisher

iet

Conference_Titel

State of the Art in Speech Synthesis (Ref. No. 2000/058), IEE Seminar on

Conference_Location

London

Type

conf

DOI

10.1049/ic:20000325

Filename

846964