Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams

Author

Silovsky, Jan ; Zdansky, Jindrich ; Nouza, Jan ; Cerva, Petr ; Prazak, Jan

Author_Institution

Inst. of Inf. Technol. & Electron., Tech. Univ. of Liberec, Liberec, Czech Republic

fYear

2012

fDate

17-19 Sept. 2012

Firstpage

118

Lastpage

123

Abstract

In this paper we study the effect of incorporation of automatic transcriptions in the speaker diarization process. We aim to improve both the diarization accuracy as evaluated by standard objective measures and quality of the diarization output from user´s perspective. Although the presented approach relies on output of an automatic speech recognizer, it makes no use of lexical information. Instead, we use information about word boundaries and classification of non-speech events occurring in the processed stream. The former information is used as constraining condition for speaker change-point candidates and the latter facilitate to neglect various vocal noise sounds that carry no speaker-specific information (considering representation of the signal by cepstral features) and thus harm the speaker´s representation. The experimental evaluation of the presented approach was carried out using the COST278 multilingual broadcast news database. We demonstrate that the approach yields improvement in terms of both speaker diarization and segmentation performance measures. Furthermore, we show that the number of change-points detected within words (and not at their boundaries) is significantly reduced.

Keywords

broadcasting; database management systems; information resources; natural language processing; pattern clustering; speaker recognition; ASR output; COST278 multilingual broadcast news database; automatic speech recognizer; broadcast streams; diarization accuracy; diarization output quality; nonspeech event classiifctaion; segmentation performance measures; speaker change-point candidates; speaker clustering; speaker diarization process; speaker segmentation; standard objective measures; vocal noise sounds; word boundaries; Covariance matrix; Databases; Smoothing methods; Speech; Speech recognition; Standards; Vectors;

fLanguage

English

Publisher

ieee

Conference_Titel

Multimedia Signal Processing (MMSP), 2012 IEEE 14th International Workshop on

Conference_Location

Banff, AB

Print_ISBN

978-1-4673-4570-5

Electronic_ISBN

978-1-4673-4571-2

Type

conf

DOI

10.1109/MMSP.2012.6343426

Filename

6343426