• DocumentCode
    3102377
  • Title

    Characteristics and spectral features used in automatic prediction of vowel duration in spontaneous speech

  • Author

    Beke, Andras ; Gosy, M.

  • Author_Institution
    MTA Research Institute for Linguistics of the Hungarian Academy of Sciences/Phonetics, Budapest, Hungary
  • fYear
    2012
  • fDate
    2-5 Dec. 2012
  • Firstpage
    65
  • Lastpage
    70
  • Abstract
    Many phonetic and phonology domain research papers analyzed segmental duration: what factors and interactions between factors determine their duration. Their results often play an important role in Language Technology applications, for example TTS (text-to-speech synthesis), ASR (automatic speech recognition) widely used in infocommunication. Speech sound duration depends on various factors such as phonetic quality, phonological context, phonological position in the word or in the utterance, speech style, etc. We intended to automatically predict vowel duration in spontaneous speech based on three methods. (i) A classification/regression tree (CART) using some characteristic features of the vowel quality and context. (ii) The same features and feedforward neural network (FFNN) were used to model vowel duration. (iii) In the third method FFNN was used to predict vowel duration using the combination of characteristic features and spectral features. Empirical durational data were obtained by measuring vowel durations as attested in over 110 minutes of a large Hungarian spontaneous speech data base (BEA). Using CART there was a poor correlation (0.57) between measured and predicted vowel duration, with average RMSE (root mean square error) of approximately 33 ms. When using FFNN the results were slightly better: the correlation between the target and predicted vowel duration was 0.62 while RMSE was about 29 ms. When the combined features were used the results were even better: the correlation between the target and predicted vowel duration was 0.79 while RMSE was 25 ms. The results obtained for Hungarian support the complexity of features affecting vowel duration, on the one hand, while on the other they indicate the temporal complexity of segmental level of spontaneous speech, as has already been reported for Lithuanian, Czech, Hindi, Telugu and Korean.
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cognitive Infocommunications (CogInfoCom), 2012 IEEE 3rd International Conference on
  • Conference_Location
    Kosice, Slovakia
  • Print_ISBN
    978-1-4673-5187-4
  • Electronic_ISBN
    978-1-4673-5186-7
  • Type

    conf

  • DOI
    10.1109/CogInfoCom.2012.6421951
  • Filename
    6421951