• DocumentCode
    3431200
  • Title

    Photo-real talking head with deep bidirectional LSTM

  • Author

    Bo Fan ; Lijuan Wang ; Soong, Frank K. ; Lei Xie

  • Author_Institution
    Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi´an, China
  • fYear
    2015
  • fDate
    19-24 April 2015
  • Firstpage
    4884
  • Lastpage
    4888
  • Abstract
    Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject´s talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.
  • Keywords
    audio databases; audio signal processing; face recognition; feature extraction; feedforward neural nets; image sequences; recurrent neural nets; regression analysis; speech synthesis; stereo image processing; visual databases; AAM; SSE; active-appearance-model; audio database; audio modeling; audio stereo data; contextual label sequences; deep bidirectional LSTM; feed-forward layer; forced aligning audio; long short-term memory; parallel temporal sequences; photo-real talking head system; recurrent neural network architecture; regression model; sum-of-square error minimization; visual database; visual feature sequences; visual modeling; visual speech synthesis; visual stereo data; Active appearance model; Face; Hidden Markov models; Shape; Speech; Visualization; AAM; BLSTM; RNN; talking head;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on
  • Conference_Location
    South Brisbane, QLD
  • Type

    conf

  • DOI
    10.1109/ICASSP.2015.7178899
  • Filename
    7178899