Unsupervised detection of multimodal clusters in edited recordings

Author

Dielmann, Alfred

Author_Institution

IDIAP Res. Inst., Martigny, Switzerland

fYear

2010

fDate

4-6 Oct. 2010

Firstpage

177

Lastpage

182

Abstract

Edited video recordings, such as talk-shows and sitcoms, often include Audio-Visual clusters: frequent repetitions of closely related acoustic and visual content. For example during a political debate, every time that a given participant holds the conversational floor, her/his voice tends to co-occur with camera views (i.e. shots) showing her/his portrait. Differently from the previous Audio-Visual clustering works, this paper proposes an unsupervised approach that detects Audio-Visual clusters, avoiding to make assumptions on the recording content, such as the presence of specific participant voices or faces. Sequences of audio and shot clusters are automatically identified using unsupervised audio diarization and shot segmentation techniques. Audio-Visual clusters are then formed by ranking the co-occurrences between these two segmentations and selecting those which significantly go beyond chance. Numerical experiments performed on a collection of 70 political debates, comprising more than 43 hours of live edited recordings, showed that automatically extracted AudioVisual clusters well match the ground-truth annotation, achieving high purity performances.

Keywords

audio-visual systems; pattern clustering; audio visual clustering; edited recording; multimodal cluster; unsupervised detection; Cameras; Gold; Hidden Markov models; Irrigation; Manuals; Measurement; Visualization;

fLanguage

English

Publisher

ieee

Conference_Titel

Multimedia Signal Processing (MMSP), 2010 IEEE International Workshop on

Conference_Location

Saint Malo

Print_ISBN

978-1-4244-8110-1

Electronic_ISBN

978-1-4244-8111-8

Type

conf

DOI

10.1109/MMSP.2010.5662015

Filename

5662015