Characteristics and spectral features used in automatic prediction of vowel duration in spontaneous speech

Author

Beke, Andras ; Gosy, M.

Author_Institution

MTA Research Institute for Linguistics of the Hungarian Academy of Sciences/Phonetics, Budapest, Hungary

fYear

2012

fDate

2-5 Dec. 2012

Firstpage

65

Lastpage

70

Abstract

Many phonetic and phonology domain research papers analyzed segmental duration: what factors and interactions between factors determine their duration. Their results often play an important role in Language Technology applications, for example TTS (text-to-speech synthesis), ASR (automatic speech recognition) widely used in infocommunication. Speech sound duration depends on various factors such as phonetic quality, phonological context, phonological position in the word or in the utterance, speech style, etc. We intended to automatically predict vowel duration in spontaneous speech based on three methods. (i) A classification/regression tree (CART) using some characteristic features of the vowel quality and context. (ii) The same features and feedforward neural network (FFNN) were used to model vowel duration. (iii) In the third method FFNN was used to predict vowel duration using the combination of characteristic features and spectral features. Empirical durational data were obtained by measuring vowel durations as attested in over 110 minutes of a large Hungarian spontaneous speech data base (BEA). Using CART there was a poor correlation (0.57) between measured and predicted vowel duration, with average RMSE (root mean square error) of approximately 33 ms. When using FFNN the results were slightly better: the correlation between the target and predicted vowel duration was 0.62 while RMSE was about 29 ms. When the combined features were used the results were even better: the correlation between the target and predicted vowel duration was 0.79 while RMSE was 25 ms. The results obtained for Hungarian support the complexity of features affecting vowel duration, on the one hand, while on the other they indicate the temporal complexity of segmental level of spontaneous speech, as has already been reported for Lithuanian, Czech, Hindi, Telugu and Korean.

fLanguage

English

Publisher

ieee

Conference_Titel

Cognitive Infocommunications (CogInfoCom), 2012 IEEE 3rd International Conference on

Conference_Location

Kosice, Slovakia

Print_ISBN

978-1-4673-5187-4

Electronic_ISBN

978-1-4673-5186-7

Type

conf

DOI

10.1109/CogInfoCom.2012.6421951

Filename

6421951