• DocumentCode
    3527255
  • Title

    Ensemble speaker and speaking environment modeling approach with advanced online estimation process

  • Author

    Tsao, Yu ; Li, Jinyu ; Lee, Chin-Hui

  • Author_Institution
    Sch. of Electr. & Comput. Eng., Georgia Inst. of Technol., Atlanta, GA
  • fYear
    2009
  • fDate
    19-24 April 2009
  • Firstpage
    3833
  • Lastpage
    3836
  • Abstract
    Recently, we proposed an ensemble speaker and speaking environment modeling (ESSEM) framework to characterize speaker variability and speaking environments. In contrast to multi-style training, ESSEM uses single-style training to prepare multiple sets of environment-specific acoustic models. The ensemble of these acoustic models forms a prior structure of the environment for flexible prediction of unknown environment during testing. In this study, we present methods to further improve the precision for model characterization. We first study a weighted N-best information technique to well utilize the N-best transcription hypothesis in an unsupervised adaptation manner. Next, we introduce cohort selection and environment space adaptation techniques to online improve the resolution and coverage of the prior structure. With an integration of the proposed methods, we further improve the ESSEM performance over our previous study. On the Aurora-2 task, ESSEM achieves an average word error rate (WER) of 4.64%, corresponding to a 15.64% relative WER reduction over our best baseline result (5.50% to 4.64% WER) obtained with multi-condition training.
  • Keywords
    estimation theory; hidden Markov models; speech recognition; N-best information technique; N-best transcription hypothesis; acoustic models; advanced online estimation process; automatic speech recognition; average word error rate; ensemble speaker and speaking environment modeling; hidden Markov model; multistyle training; single-style training; space adaptation techniques; Acoustic distortion; Acoustic testing; Automatic speech recognition; Error analysis; Hidden Markov models; Loudspeakers; Maximum likelihood linear regression; Phase estimation; Robustness; Stochastic processes; N-best transcription; ensemble speaker and speaking environment modeling; noise robustness;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on
  • Conference_Location
    Taipei
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4244-2353-8
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2009.4960463
  • Filename
    4960463