DocumentCode
984469
Title
Speaker association with signal-level audiovisual fusion
Author
Fisher, John W., III ; Darrell, Trevor
Author_Institution
Comput. Sci. & Artificial Intelligence Lab., Massachusetts Inst. of Technol., Cambridge, MA, USA
Volume
6
Issue
3
fYear
2004
fDate
6/1/2004 12:00:00 AM
Firstpage
406
Lastpage
413
Abstract
Audio and visual signals arriving from a common source are detected using a signal-level fusion technique. A probabilistic multimodal generation model is introduced and used to derive an information theoretic measure of cross-modal correspondence. Nonparametric statistical density modeling techniques can characterize the mutual information between signals from different domains. By comparing the mutual information between different pairs of signals, it is possible to identify which person is speaking a given utterance and discount errant motion or audio from other utterances or nonspeech events.
Keywords
audio signal processing; image sequences; interactive systems; probability; speech recognition; statistical analysis; video signal processing; audio signals; cross-modal correspondence; discount errant motion; mutual information theoretic measure; nonparametric statistical density modeling techniques; nonspeech events; probabilistic multimodal generation model; signal-level audiovisual fusion; speaker data association; visual signals; Computer science; Databases; Fusion power generation; Microphones; Mutual information; Signal detection; Signal processing; Speech recognition; Telephone sets; Telephony; Audiovisual correspondence; multimodal data association; mutual information;
fLanguage
English
Journal_Title
Multimedia, IEEE Transactions on
Publisher
ieee
ISSN
1520-9210
Type
jour
DOI
10.1109/TMM.2004.827503
Filename
1298813
Link To Document