• DocumentCode
    984469
  • Title

    Speaker association with signal-level audiovisual fusion

  • Author

    Fisher, John W., III ; Darrell, Trevor

  • Author_Institution
    Comput. Sci. & Artificial Intelligence Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA
  • Volume
    6
  • Issue
    3
  • fYear
    2004
  • fDate
    6/1/2004 12:00:00 AM
  • Firstpage
    406
  • Lastpage
    413
  • Abstract
    Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.
  • Keywords
    audio signal processing; image sequences; interactive systems; probability; speech recognition; statistical analysis; video signal processing; audio signals; cross-modal correspondence; discount errant motion; mutual information theoretic measure; nonparametric statistical density modeling techniques; nonspeech events; probabilistic multimodal generation model; signal-level audiovisual fusion; speaker data association; visual signals; Computer science; Databases; Fusion power generation; Microphones; Mutual information; Signal detection; Signal processing; Speech recognition; Telephone sets; Telephony; Audiovisual correspondence; multimodal data association; mutual information;
  • fLanguage
    English
  • Journal_Title
    Multimedia, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1520-9210
  • Type

    jour

  • DOI
    10.1109/TMM.2004.827503
  • Filename
    1298813