• DocumentCode
    989974
  • Title

    Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings

  • Author

    Gatica-Perez, Daniel ; Lathoud, Guillaume ; Odobez, Jean-Marc ; McCowan, Iain

  • Author_Institution
    IDIAP Res. Inst., Ecole Polytechnique Federale de Lausanne, Martigny
  • Volume
    15
  • Issue
    2
  • fYear
    2007
  • Firstpage
    601
  • Lastpage
    616
  • Abstract
    Tracking speakers in multiparty conversations constitutes a fundamental task for automatic meeting analysis. In this paper, we present a novel probabilistic approach to jointly track the location and speaking activity of multiple speakers in a multisensor meeting room, equipped with a small microphone array and multiple uncalibrated cameras. Our framework is based on a mixed-state dynamic graphical model defined on a multiperson state-space, which includes the explicit definition of a proximity-based interaction model. The model integrates audiovisual (AV) data through a novel observation model. Audio observations are derived from a source localization algorithm. Visual observations are based on models of the shape and spatial structure of human heads. Approximate inference in our model, needed given its complexity, is performed with a Markov Chain Monte Carlo particle filter (MCMC-PF), which results in high sampling efficiency. We present results-based on an objective evaluation procedure-that show that our framework 1) is capable of locating and tracking the position and speaking activity of multiple meeting participants engaged in real conversations with good accuracy, 2) can deal with cases of visual clutter and occlusion, and 3) significantly outperforms a traditional sampling-based approach
  • Keywords
    Markov processes; Monte Carlo methods; architectural acoustics; audio acoustics; audio signal processing; audio visual systems; face recognition; microphone arrays; particle filtering (numerical methods); speaker recognition; Markov chain Monte Carlo particle filter; audiovisual probabilistic tracking; automatic meeting analysis; face detection; microphone array; mixed-state dynamic graphical model; multiparty conversations; multiperson state-space; multiple speakers; multisensor meeting room; proximity-based interaction model; sampling-based approach; source localization algorithm; visual clutter; Cameras; Graphical models; Humans; Microphone arrays; Object recognition; Predictive models; Sampling methods; Speech processing; Speech recognition; Streaming media; Meetings; Monte Carlo methods; multimodal fusion; tracking;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2006.881678
  • Filename
    4067033