Title :
Segmentation of TV shows into scenes using speaker diarization and speech recognition
Author_Institution :
Spoken Language Process. Group, LIMSI, Orsay, France
Abstract :
We investigate the use of speaker diarization (SD) and automatic speech recognition (ASR) for the segmentation of audiovisual documents into scenes. We introduce multiple monomodal and multimodal approaches based on a state-of-the-art algorithm called generalized scene transition graph (GSTG). First, we extend the latter with the use of semantic information derived from both SD and ASR. Then, multimodal fusion of color histograms, SD and ASR is investigated at various point of the GSTG pipeline (early, late or intermediate fusion). Experiments driven on a few episodes of a popular TV show indicate that SD and ASR can be successfully combined with visual information and bring an additional +11% relative increase in terms of F1-measure for scene boundary detection over the state-of-the-art baseline.
Keywords :
audio-visual systems; image colour analysis; image segmentation; speech recognition; video signal processing; F1-measure; TV show segmentation; audiovisual document segmentation; automatic speech recognition; color histograms; generalized scene transition graph; scene boundary detection; speaker diarization; visual information; Automatic speech recognition; Color; Histograms; Semantics; TV; Visualization; multimodal fusion; scene boundary detection; scene transition graph; speaker diarization; speech recognition;
Conference_Titel :
Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on
Conference_Location :
Kyoto
Print_ISBN :
978-1-4673-0045-2
Electronic_ISBN :
1520-6149
DOI :
10.1109/ICASSP.2012.6288393