The 2015 sheffield system for longitudinal diarisation of broadcast media

Author

Rosanna Milner;Oscar Saz;Salil Deena;Mortaza Doulaty;Raymond W. M. Ng;Thomas Hain

Author_Institution

Speech and Hearing Research group, Department of Computer Science, University of Sheffield, UK

fYear

2015

Firstpage

632

Lastpage

638

Abstract

Speaker diarisation is the task of answering "who spoke when" within a multi-speaker audio recording. Diarisation of broadcast media typically operates on individual television shows, and is a particularly difficult task, due to a high number of speakers and challenging background conditions. Using prior knowledge, such as that from previous shows in a series, can improve performance. Longitudinal diarisation allows to use knowledge from previous audio files to improve performance, but requires finding matching speakers across consecutive files. This paper describes the University of Sheffield system for participation in the 2015 Multi-Genre Broadcast (MGB) challenge. The challenge required longitudinal diarisation of data from BBC archives, under very constrained resource settings. Our system consists of three main stages: speech activity detection using DNNs with novel adaptation and decoding methods; speaker segmentation and clustering, with adaptation of the DNN-based clustering models; and finally speaker linking to match speakers across shows. The final result on the development set of 19 shows from five different television series provided a Diarisation Error Rate of 50.77% in the diarisation and linking task.

Keywords

"Speech","Joining processes","Training","Adaptation models","Decoding","Hidden Markov models","Density estimation robust algorithm"

Publisher

ieee

Conference_Titel

Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on

Type

conf

DOI

10.1109/ASRU.2015.7404855

Filename

7404855