• DocumentCode
    1010045
  • Title

    State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition

  • Author

    Liu, Yi ; Fung, Pascale

  • Author_Institution
    Dept. of Electr. & Electron. Eng., Hong Kong Univ. of Sci. & Technol., China
  • Volume
    12
  • Issue
    4
  • fYear
    2004
  • fDate
    7/1/2004 12:00:00 AM
  • Firstpage
    351
  • Lastpage
    364
  • Abstract
    We propose a method of incorporating pronunciation modeling into acoustic models with high discriminative power and low complexity to improve spontaneous speech recognition accuracy. Spontaneous speech contains a higher level of phonetic and acoustic confusions due to the larger degree of pronunciation variations caused by speaking rate, speaker style, speaking mode, speaker accent, etc. In general data-driven complexity-reduction methods without explicit modeling of pronunciation variations, the acoustic model is not robust enough to capture the flexible phonetic confusions and pronunciation variants in spontaneous speech. We propose a state-dependent phonetic tied-mixture (PTM) model with variable codebook size to improve the coverage of phonetic variations while maintaining model discriminative ability. Our state-dependent PTM model incorporates a state-level pronunciation model for better discrimination of phonetic and acoustic confusions, while reducing model complexity. Experimental results on the spontaneous speech part of Mandarin Broadcast News shows that our model outperforms state tying and mixture tying models by 2.46% and 3.51% absolute syllable error rate reduction, respectively, with comparable model complexity. After adding Gaussian sharing to the latter models, our proposed model still yields an additional 1% and 2.6% absolute syllable error rate reduction. In addition, unlike many complexity reduction methods, our method does not lead to any performance degradation on read speech.
  • Keywords
    Gaussian distribution; acoustic signal processing; dynamic programming; error statistics; speech coding; speech recognition; Gaussian sharing; absolute syllable error rate reduction; acoustic confusions; acoustic models; data-driven complexity-reduction methods; high discriminative power; model complexity; model discriminative ability; phonetic confusions; pronunciation modelling; speaker accent; speaker style; speaking mode; speaking rate; spontaneous speech recognition accuracy; state-dependent phonetic tied-mixture model; state-level pronunciation model; variable codebook size; Automatic speech recognition; Broadcasting; Context modeling; Degradation; Error analysis; Helium; Hidden Markov models; Loudspeakers; Robustness; Speech recognition;
  • fLanguage
    English
  • Journal_Title
    Speech and Audio Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1063-6676
  • Type

    jour

  • DOI
    10.1109/TSA.2004.828638
  • Filename
    1306509