• DocumentCode
    1880966
  • Title

    A real-time prototype for small-vocabulary audio-visual ASR

  • Author

    Connell, J.H. ; Haas, N. ; Marcheret, E. ; Neti, C. ; Potamianos, G. ; Velipasalar, S.

  • Author_Institution
    IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA
  • Volume
    2
  • fYear
    2003
  • fDate
    6-9 July 2003
  • Abstract
    We present a prototype for the automatic recognition of audio-visual speech, developed to augment the IBM ViaVoice™ speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentium™ 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is there- fore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.
  • Keywords
    audio-visual systems; hidden Markov models; image recognition; real-time systems; speech recognition; video signal processing; PC camera; acoustic signal; appearance-based visual features; automatic speech recognition; feature fusion technique; hidden Markov model; real-time prototype; small-vocabulary audio-visual ASR; Automatic speech recognition; Cameras; Computer architecture; Engines; Feature extraction; Hidden Markov models; Prototypes; Speech recognition; System performance; Universal Serial Bus;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on
  • Print_ISBN
    0-7803-7965-9
  • Type

    conf

  • DOI
    10.1109/ICME.2003.1221655
  • Filename
    1221655