• DocumentCode
    3744870
  • Title

    Multimodal embedding fusion for robust speaker role recognition in video broadcast

  • Author

    Michael Rouvier;Sebastien Delecraz;Benoit Favre;Meriem Bendris;Frederic Bechet

  • Author_Institution
    Aix-Marseille Universit?, CNRS, LIF, Marseille, France
  • fYear
    2015
  • Firstpage
    383
  • Lastpage
    389
  • Abstract
    Person role recognition in video broadcasts consists in classifying people into roles such as anchor, journalist, guest, etc. Existing approaches mostly consider one modality, either audio (speaker role recognition) or image (shot role recognition), firstly because of the non-synchrony between both modalities, and secondly because of the lack of a video corpus annotated in both modalities. Deep Neural Networks (DNN) approaches offer the ability to learn simultaneously feature representations (embeddings) and classification functions. This paper presents a multimodal fusion of audio, text and image embeddings spaces for speaker role recognition in asynchronous data. Monomodal embeddings are trained on exogenous data and fine-tuned using a DNN on 70 hours of French Broadcasts corpus for the target task. Experiments on the REPERE corpus show the benefit of the embeddings level fusion compared to the monomodal embeddings systems and to the standard late fusion method.
  • Keywords
    "Feature extraction","Image recognition","Visualization","Acoustics","Speech","Neural networks","Support vector machines"
  • Publisher
    ieee
  • Conference_Titel
    Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on
  • Type

    conf

  • DOI
    10.1109/ASRU.2015.7404820
  • Filename
    7404820