Title :
Speech-driven face synthesis from 3D video
Author :
Ypsilos, Ioannis A. ; Hilton, Adrian ; Turkmani, Aseel ; Jackson, Philip J B
Author_Institution :
Centre for Vision, Speech & Signal Process., Surrey Univ., Guildford, UK
Abstract :
We present a framework for speech-driven synthesis of real faces from a corpus of 3D video of a person speaking. Video-rate capture of dynamic 3D face shape and colour appearance provides the basis for a visual speech synthesis model. A displacement map representation combines face shape and colour into a 3D video. This representation is used to efficiently register and integrate shape and colour information captured from multiple views. To allow visual speech synthesis viseme primitives are identified from the corpus using automatic speech recognition. A novel nonrigid alignment algorithm is introduced to estimate dense correspondence between 3D face shape and appearance for different visemes. The registered displacement map representation together with a novel optical flow optimisation using both shape and colour, enables accurate and efficient nonrigid alignment. Face synthesis from speech is performed by concatenation of the corresponding viseme sequence using the nonrigid correspondence to reproduce both 3D face shape and colour appearance. Concatenative synthesis reproduces both viseme timing and co-articulation. Face capture and synthesis has been performed for a database of 51 people. Results demonstrate synthesis of 3D visual speech animation with a quality comparable to the captured video of a person.
Keywords :
computer animation; face recognition; image colour analysis; image representation; image sequences; speech recognition; speech synthesis; video recording; visual databases; 3D face colour appearance; 3D video; 3D visual speech animation; automatic speech recognition; co-articulation; concatenative synthesis; displacement map representation; dynamic 3D face shape; face databases; nonrigid alignment algorithm; optical flow optimisation; speech-driven face synthesis; video-rate face capture; viseme sequence; viseme timing; visual speech synthesis model; Face detection; Facial animation; Image motion analysis; Image reconstruction; Optical films; Optical sensors; Production; Shape; Signal synthesis; Speech synthesis;
Conference_Titel :
3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004. Proceedings. 2nd International Symposium on
Print_ISBN :
0-7695-2223-8
DOI :
10.1109/TDPVT.2004.1335143