DocumentCode
3744963
Title
Talking heads synthesis from audio with deep neural networks
Author
Taiki Shimba;Ryuhei Sakurai;Hirotake Yamazoe;Joo-Ho Lee
Author_Institution
Graduate School of Information Science and Engineering, Ritsumeikan University, 1-1-1 Kusatsu, Shiga, Japan
fYear
2015
Firstpage
100
Lastpage
105
Abstract
Talking heads synthesis with expressions from speech is proposed in this paper. Talking heads synthesis can be considered as a learning problem of sequence-to-sequence mapping, which consists of audio as input and video as output. To synthesize talking heads, we use SAVEE database which consists of videos of multiple sentences speeches recorded from front of face. Audiovisual data can be considered as two parallel sequential data of audio and visual features and it is composed of continuous value. Thus, audio and visual features of our dataset are represented by a regression model. In this research, the regression model is trained with long short-term memory (LSTM) by minimizing mean squared error (MSE). Then, audio features are used as input and visual features are used as target of LSTM. Thereby, talking heads are synthesized from speech. Our method is proposed to use lower level audio features than phonemes and it enables to synthesize talking heads with expressions while existing researches which use phonemes as audio features only can synthesize neutral expression talking heads. With SAVEE database, we achieved the minimum MSE 17.03 on our testing dataset. In experiment, we use mel-frequency cepstral coefficient (MFCC), AMFCC and A2 MFCC with energy as audio feature and active appearance model (AAM) on entire face region as visual feature.
Keywords
"Active appearance model","Speech","Visualization","Mel frequency cepstral coefficient","Face","Shape","Feature extraction"
Publisher
ieee
Conference_Titel
System Integration (SII), 2015 IEEE/SICE International Symposium on
Type
conf
DOI
10.1109/SII.2015.7404961
Filename
7404961
Link To Document