Talking heads synthesis from audio with deep neural networks

Author

Taiki Shimba;Ryuhei Sakurai;Hirotake Yamazoe;Joo-Ho Lee

Author_Institution

Graduate School of Information Science and Engineering, Ritsumeikan University, 1-1-1 Kusatsu, Shiga, Japan

fYear

2015

Firstpage

100

Lastpage

105

Abstract

Talking heads synthesis with expressions from speech is proposed in this paper. Talking heads synthesis can be considered as a learning problem of sequence-to-sequence mapping, which consists of audio as input and video as output. To synthesize talking heads, we use SAVEE database which consists of videos of multiple sentences speeches recorded from front of face. Audiovisual data can be considered as two parallel sequential data of audio and visual features and it is composed of continuous value. Thus, audio and visual features of our dataset are represented by a regression model. In this research, the regression model is trained with long short-term memory (LSTM) by minimizing mean squared error (MSE). Then, audio features are used as input and visual features are used as target of LSTM. Thereby, talking heads are synthesized from speech. Our method is proposed to use lower level audio features than phonemes and it enables to synthesize talking heads with expressions while existing researches which use phonemes as audio features only can synthesize neutral expression talking heads. With SAVEE database, we achieved the minimum MSE 17.03 on our testing dataset. In experiment, we use mel-frequency cepstral coefficient (MFCC), AMFCC and A2 MFCC with energy as audio feature and active appearance model (AAM) on entire face region as visual feature.

Keywords

"Active appearance model","Speech","Visualization","Mel frequency cepstral coefficient","Face","Shape","Feature extraction"

Publisher

ieee

Conference_Titel

System Integration (SII), 2015 IEEE/SICE International Symposium on

Type

conf

DOI

10.1109/SII.2015.7404961

Filename

7404961