• DocumentCode
    394198
  • Title

    Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

  • Author

    Garg, Ashutosh ; Potamianos, Gerasimos ; Neti, Chalapathy ; Huang, Thomas S.

  • Author_Institution
    Beckman Inst., Univ. of Illinois, Urbana, IL, USA
  • Volume
    1
  • fYear
    2003
  • fDate
    6-10 April 2003
  • Abstract
    We investigate the use of local, frame-dependent reliability indicators of the audio and visual modalities, as a means of estimating stream exponents of multi-stream hidden Markov models for audio-visual automatic speech recognition. We consider two such indicators at each modality, defined as functions of the speech-class conditional observation probabilities of appropriate audio-or visual-only classifiers. We subsequently map the four reliability indicators into the stream exponents of a state-synchronous, two-stream hidden Markov model, as a sigmoid function of their linear combination. We propose two algorithms to estimate the sigmoid weights, based on the maximum conditional likelihood and minimum classification error criteria. We demonstrate the superiority of the proposed approach on a connected-digit audio-visual speech recognition task, under varying audio channel noise conditions. Indeed, the use of the estimated, frame-dependent stream exponents results in a significantly smaller word error rate than using global stream exponents. In addition, it outperforms utterance-level exponents, even though the latter utilize a-priori knowledge of the utterance noise level.
  • Keywords
    audio signal processing; audio-visual systems; hidden Markov models; maximum likelihood estimation; noise; probability; signal classification; speech recognition; video signal processing; HMM; audio channel noise conditions; audio-only classifiers; audio-visual speech recognition; connected-digit audio-visual speech recognition; frame-dependent multi-stream reliability indicators; global stream exponents; local reliability indicators; maximum conditional likelihood classification error; minimum classification error; multi-stream hidden Markov models; sigmoid function; sigmoid weights estimation; speech-class conditional observation probabilities; state-synchronous hidden Markov model; stream exponents estimation; utterance noise level; utterance-level exponents; visual-only classifiers; word error rate; Automatic speech recognition; Degradation; Hidden Markov models; Humans; Neural networks; Noise level; Robustness; Speech recognition; State estimation; Streaming media;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on
  • ISSN
    1520-6149
  • Print_ISBN
    0-7803-7663-3
  • Type

    conf

  • DOI
    10.1109/ICASSP.2003.1198707
  • Filename
    1198707