• DocumentCode
    2150563
  • Title

    Improving acoustic event detection using generalizable visual features and multi-modality modeling

  • Author

    Huang, Po-Sen ; Zhuang, Xiaodan ; Hasegawa-Johnson, Mark

  • Author_Institution
    ECE Dept., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
  • fYear
    2011
  • fDate
    22-27 May 2011
  • Firstpage
    349
  • Lastpage
    352
  • Abstract
    Acoustic event detection (AED) aims to identify both timestamps and types of multiple events and has been found to be very challenging. The cues for these events often times exist in both audio and vision, but not necessarily in a synchronized fashion. We study improving the detection and classification of the events using cues from both modalities. We propose optical flow based spatial pyramid histograms as a generalizable visual representation that does not require training on labeled video data. Hidden Markov models (HMMs) are used for audio-only modeling, and multi-stream HMMs or coupled HMMs (CHMM) are used for audio-visual joint modeling. To allow the flexibility of audio-visual state asynchrony, we explore effective CHMM training via HMM state-space mapping, parameter tying and different initialization schemes. The proposed methods successfully improve acoustic event classification and detection on a multimedia meeting room dataset containing eleven types of general non-speech events without using extra data resource other than the video stream accompanying the audio observations. Our systems perform favorably compared to previously reported systems leveraging ad-hoc visual cue detectors and localization information obtained from multiple microphones.
  • Keywords
    acoustic signal detection; audio signal processing; audio streaming; audio-visual systems; hidden Markov models; image sequences; video streaming; CHMM training; HMM state-space mapping; acoustic event classification; acoustic event detection; audio-only modeling; audio-visual joint modeling; audio-visual state asynchrony flexibility; coupled HMM; generalizable visual features; generalizable visual representation; hidden Markov model; multimedia meeting room dataset; multimodality modeling; multiple microphone; multistream HMM; nonspeech events; optical flow based spatial pyramid histogram; parameter tying; timestamps; Acoustics; Event detection; Feature extraction; Hidden Markov models; Histograms; Speech; Visualization; acoustic event detection; coupled hidden Markov models; hidden Markov models; multi-stream HMM; optical flow;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on
  • Conference_Location
    Prague
  • ISSN
    1520-6149
  • Print_ISBN
    978-1-4577-0538-0
  • Electronic_ISBN
    1520-6149
  • Type

    conf

  • DOI
    10.1109/ICASSP.2011.5946412
  • Filename
    5946412