• DocumentCode
    739502
  • Title

    Multimodal Multi-Channel On-Line Speaker Diarization Using Sensor Fusion Through SVM

  • Author

    Peruffo Minotto, Vicente ; Rosito Jung, Claudio ; Lee, Bowon

  • Author_Institution
    Institute of Informatics, Federal University of Rio Grande do Sul., Porto Alegre, Brazil
  • Volume
    17
  • Issue
    10
  • fYear
    2015
  • Firstpage
    1694
  • Lastpage
    1705
  • Abstract
    Speaker diarization (SD) is the process of assigning speech segments of an audio stream to its corresponding speakers, thus comprising the problem of voice activity detection (VAD), speaker labeling/identification, and often sound source localization (SSL). Most research activities in the past aimed towards applications as broadcast news, meetings, conversational telephony, and automatic multimodal data annotation, where SD may be performed off-line. However, a recent research focus is human–computer interaction (HCI) systems where SD must be performed on-line, and in real-time, as in modern gaming devices and interaction with large displays. Often, such applications further suffer from noise, reverberations, and overlapping speech, making them increasingly challenging. In such situations, multimodal/multisensory approaches can provide more accurate results than unimodal ones, given a data stream may compensate for occasional instabilities of other modalities. Accordingly, this paper presents an on-line multimodal SD algorithm designed to work in a realistic environment with multiple, overlapping speakers. Our work employs a microphone array, a color camera, and a depth sensor as input streams, from which speech-related features are extracted to be later merged through a support vector machine approach consisting of VAD and SSL modules. Speaker identification is incorporated through a hybrid technique of face positioning history and face recognition. Our final SD approach experimentally achieves an average diarization error rate of 11.48% in scenarios with up to three simultaneous speakers, and is able to run 3.2 \\times ~\\hbox {real-time} .
  • Keywords
    Arrays; Feature extraction; Human computer interaction; Microphones; Robustness; Speech; Speech processing; Beamforming; SRP-PHAT; multimodal fusion; on-line speaker diarization; sound source localization; speaker labeling; voice activity detection;
  • fLanguage
    English
  • Journal_Title
    Multimedia, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1520-9210
  • Type

    jour

  • DOI
    10.1109/TMM.2015.2463722
  • Filename
    7175035