Segmentation of TV shows into scenes using speaker diarization and speech recognition

Author

Bredin, Hervé

Author_Institution

Spoken Language Process. Group, LIMSI, Orsay, France

fYear

2012

fDate

25-30 March 2012

Firstpage

2377

Lastpage

2380

Abstract

We investigate the use of speaker diarization (SD) and automatic speech recognition (ASR) for the segmentation of audiovisual documents into scenes. We introduce multiple monomodal and multimodal approaches based on a state-of-the-art algorithm called generalized scene transition graph (GSTG). First, we extend the latter with the use of semantic information derived from both SD and ASR. Then, multimodal fusion of color histograms, SD and ASR is investigated at various point of the GSTG pipeline (early, late or intermediate fusion). Experiments driven on a few episodes of a popular TV show indicate that SD and ASR can be successfully combined with visual information and bring an additional +11% relative increase in terms of F₁-measure for scene boundary detection over the state-of-the-art baseline.

Keywords

audio-visual systems; image colour analysis; image segmentation; speech recognition; video signal processing; F₁-measure; TV show segmentation; audiovisual document segmentation; automatic speech recognition; color histograms; generalized scene transition graph; scene boundary detection; speaker diarization; visual information; Automatic speech recognition; Color; Histograms; Semantics; TV; Visualization; multimodal fusion; scene boundary detection; scene transition graph; speaker diarization; speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on

Conference_Location

Kyoto

ISSN

1520-6149

Print_ISBN

978-1-4673-0045-2

Electronic_ISBN

1520-6149

Type

conf

DOI

10.1109/ICASSP.2012.6288393

Filename

6288393