DocumentCode
3102377
Title
Characteristics and spectral features used in automatic prediction of vowel duration in spontaneous speech
Author
Beke, Andras ; Gosy, M.
Author_Institution
MTA Research Institute for Linguistics of the Hungarian Academy of Sciences/Phonetics, Budapest, Hungary
fYear
2012
fDate
2-5 Dec. 2012
Firstpage
65
Lastpage
70
Abstract
Many phonetic and phonology domain research papers analyzed segmental duration: what factors and interactions between factors determine their duration. Their results often play an important role in Language Technology applications, for example TTS (text-to-speech synthesis), ASR (automatic speech recognition) widely used in infocommunication. Speech sound duration depends on various factors such as phonetic quality, phonological context, phonological position in the word or in the utterance, speech style, etc. We intended to automatically predict vowel duration in spontaneous speech based on three methods. (i) A classification/regression tree (CART) using some characteristic features of the vowel quality and context. (ii) The same features and feedforward neural network (FFNN) were used to model vowel duration. (iii) In the third method FFNN was used to predict vowel duration using the combination of characteristic features and spectral features. Empirical durational data were obtained by measuring vowel durations as attested in over 110 minutes of a large Hungarian spontaneous speech data base (BEA). Using CART there was a poor correlation (0.57) between measured and predicted vowel duration, with average RMSE (root mean square error) of approximately 33 ms. When using FFNN the results were slightly better: the correlation between the target and predicted vowel duration was 0.62 while RMSE was about 29 ms. When the combined features were used the results were even better: the correlation between the target and predicted vowel duration was 0.79 while RMSE was 25 ms. The results obtained for Hungarian support the complexity of features affecting vowel duration, on the one hand, while on the other they indicate the temporal complexity of segmental level of spontaneous speech, as has already been reported for Lithuanian, Czech, Hindi, Telugu and Korean.
fLanguage
English
Publisher
ieee
Conference_Titel
Cognitive Infocommunications (CogInfoCom), 2012 IEEE 3rd International Conference on
Conference_Location
Kosice, Slovakia
Print_ISBN
978-1-4673-5187-4
Electronic_ISBN
978-1-4673-5186-7
Type
conf
DOI
10.1109/CogInfoCom.2012.6421951
Filename
6421951
Link To Document