• DocumentCode
    1761176
  • Title

    Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model

  • Author

    Deena, Salil ; Shaobo Hou ; Galata, A.

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Manchester, Manchester, UK
  • Volume
    15
  • Issue
    8
  • fYear
    2013
  • fDate
    Dec. 2013
  • Firstpage
    1755
  • Lastpage
    1768
  • Abstract
    In this paper, we present a novel approach to speech- driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Two talking head corpora are processed by extracting visual and audio data from the sequences followed by a parameterization of both data streams. Phonetic labels are obtained by performing forced phonetic alignment on the audio. The switching states are found using a variable length Markov model trained on the labelled phonetic data. The audio and visual data corresponding to phonemes matching each switching state are extracted and modelled together using a shared Gaussian process dynamical model. We propose a synthesis method that takes into account both previous and future phonetic context, thus accounting for forward and backward coarticulation in speech. Both objective and subjective evaluation results are presented. The quantitative results demonstrate that the proposed method outperforms other state-of-the-art methods in visual speech synthesis and the qualitative results reveal that the synthetic videos are comparable to ground truth in terms of visual perception and intelligibility.
  • Keywords
    Gaussian processes; Markov processes; computer animation; speech synthesis; visual perception; audio data extraction; backward coarticulation; data streams; forced phonetic alignment; forward coarticulation; head corpora; intelligibility; labelled phonetic data; nonparametric switching state space model; phonemes matching; phonetic context; phonetic labels; shared Gaussian process dynamical model; speech-driven facial animation; state-of-the-art methods; synthetic videos; variable length Markov model; variable-order switching; visual data extraction; visual perception; visual speech synthesis; Animation; Data models; Hidden Markov models; Speech; Speech synthesis; Switches; Visualization; Artificial Talking Head; speech-driven facial animation; visual speech synthesis;
  • fLanguage
    English
  • Journal_Title
    Multimedia, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1520-9210
  • Type

    jour

  • DOI
    10.1109/TMM.2013.2279659
  • Filename
    6585824