Initialization of Iterative-Based Speaker Diarization Systems for Telephone Conversations

Author

Ben-Harush, Oshry ; Ben-Harush, Ortal ; Lapidot, Itshak ; Guterman, Hugo

Author_Institution

Dept. of Electr. & Comput. Eng., Ben-Gurion Univ. of the Negev, Beer-Sheva, Israel

Volume

Issue

fYear

2012

Firstpage

414

Lastpage

425

Abstract

Speaker diarization systems attempt to assign temporal segments from a conversation between R speakers to an appropriate speaker r. This task is generally performed when no prior information is given regarding the speakers. The number of speakers is usually unknown and needs to be estimated. However, there are applications where the number of speakers is known in advance. The diarization process generally consists of change detection, clustering and labeling of a given audio stream. Speaker diarization can be performed using an iterative approach that is optimized by the selection of appropriate initial conditions. This study examines the influence of several common initialization algorithms including two variants of a recently proposed, K-means based initialization algorithm over the performance of an iterative-based speaker diarization system applied to two speaker telephone conversations. The suggested speaker diarization system employs either self organizing maps or Gaussian mixture models in order to model the speakers and non-speech in the conversation. The diarization system and initialization algorithms are tuned using 108 telephone conversations taken from LDC CallHome corpus, this is the development set. The evaluation subset is composed of 2048 telephone conversations extracted from the NIST 2005 Rich Transcription corpus. The results obtained show that by initializing the speaker diarization system using the K-means based algorithms provide a relative improvement of 10.4% for the LDC development set and 12.2% for the NIST evaluation subset when compared to random initialization after 12 iterations which are required for the convergence of the diarization process using random initialization. However, when using the K-means based initialization approach, only five iterations are required for the system to converge. Thus, using the new initialization allows us to improve the performances both in terms of diarization error rate and speed of co- vergence.

Keywords

Gaussian processes; iterative methods; speech processing; telephone sets; Gaussian mixture model; K-means based algorithm; K-means based initialization algorithm; LDC CallHome corpus; LDC development set; NIST 2005 rich transcription corpus; NIST evaluation subset; iterative-based speaker diarization system; self organizing map; speaker telephone conversation; temporal segment assignment; Clustering algorithms; Feature extraction; Hidden Markov models; Neurons; Speech; Training; Vectors; Gaussian mixture model (GMM); hidden-Markov model (HMM); initialization; self-organizing map; speaker diarization;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher

ieee

ISSN

1558-7916

Type

jour

DOI

10.1109/TASL.2011.2161079

Filename

6136473

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1427439