مرکز منطقه ای اطلاع رساني علوم و فناوري - A Multimodal Approach to Speaker Diarization on TV Talk-Shows

DocumentCode :

1757617

Title :

A Multimodal Approach to Speaker Diarization on TV Talk-Shows

Author :

Vallet, Felicien ; Essid, Slim ; Carrive, Jean

Author_Institution :

Res. Dept., Inst. Nat. de l´´Audiovisuel, Bry-sur-Marne, France

Volume :

Issue :

fYear :

2013

fDate :

41365

Firstpage :

509

Lastpage :

520

Abstract :

In this article, we propose solutions to the problem of speaker diarization of TV talk-shows, a problem for which adapted multimodal approaches, relying on other streams of data than only audio, remain largely under exploited. Hence we propose an original system that leverages prior knowledge on the structure of this type of content, especially the visual information relating to the active speakers, for an improved diarization performance. The architecture of this system can be decomposed into two main stages. First a reliable training set is created, in an unsupervised fashion, for each participant of the TV program being processed. This data is assembled by the association of visual and audio descriptors carefully selected in a clustering cascade. Then, Support Vector Machines are used for the classification of the speech data (of a given TV program). The performance of this new architecture is assessed on two French talk-show collections: Le Grand Échiquier and On n´a pas tout dit. The results show that our new system outperforms state-of-the-art methods, thus evidencing the effectiveness of kernel-based methods, as well as visual cues, in multimodal approaches to speaker diarization of challenging contents such as TV talk-shows.

Keywords :

speaker recognition; support vector machines; French talk show collection; TV program; TV talk shows; audio descriptors; clustering cascade; diarization performance; kernel based method; multimodal approach; original system; reliable training set; speaker diarization; speech data; support vector machines; visual descriptors; Cameras; Databases; Microphones; NIST; Speech; TV; Visualization; Fusion; SVM classification; joint audiovisual processing; multimodality; speaker diarization; talk-show; unsupervised learning;

fLanguage :

English

Journal_Title :

Multimedia, IEEE Transactions on

Publisher :

ieee

ISSN :

1520-9210

Type :

jour

DOI :

10.1109/TMM.2012.2233724

Filename :

6380624

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1757617