Development of text-to-audiovisual speech synthesis to support interactive language learning on a mobile device

Author

Wai-Kim Leung ; Ka-Wa Yuen ; Ka-Ho Wong ; Meng, Hsiang-Yun

Author_Institution

Dept. of Syst. Eng. & Eng. Manage., Chinese Univ. of Hong Kong, Hong Kong, China

fYear

2013

fDate

2-5 Dec. 2013

Firstpage

583

Lastpage

588

Abstract

We have developed distributed text-to-audiovisual-speech synthesizer (TTAVS) to support interactivity in computer-aided pronunciation training (CAPT) on a mobile platform. The TTAVS serves to generate audiovisual corrective feedback based on detected mispronunciations from the second language learner´s speech. Our approach encodes key visemes in SVG format that are compressed by GZIP and transmitted to the client, where the browser can perform real-time morphing to render the visual speech. We have also developed a TTAVS animation player that can play the audio and visual speech synchronously while enabling user controls in play/pause/resume. Evaluation shows that this newly proposed approach, vis-à-vis our original approach that involves generation of an Ogg video on the server-side which is streamed to the client, achieves a significant reduction (66%) in average size of the output files that are transmitted from the server to the client, reduction of (83%) in client waiting times, as well as preserve the quality of the image.

Keywords

audio streaming; audio-visual systems; client-server systems; computer animation; computer based training; interactive video; mobile computing; natural language processing; rendering (computer graphics); speech synthesis; video streaming; GZIP; Ogg video generation; Ogg video streaming; SVG format compression; TTAVS animation player; audio speech rendering; audiovisual corrective feedback; browser; client waiting time reduction; client-server system; computer aided pronunciation training; distributed TTAVS; image quality preservation; interactive language learning; mispronunciation detection; mispronunciation generation; mobile device; play-pause-resume; real-time morphing; second language learner speech; text-to-audiovisual speech synthesis; user control; vis-à-vis; visual speech rendering; Animation; Generators; Servers; Speech; Speech synthesis; Streaming media; Visualization; computer aided-pronunciation training system (CAPT); language learning; visual speech synthesizer;

fLanguage

English

Publisher

ieee

Conference_Titel

Cognitive Infocommunications (CogInfoCom), 2013 IEEE 4th International Conference on

Conference_Location

Budapest

Print_ISBN

978-1-4799-1543-9

Type

conf

DOI

10.1109/CogInfoCom.2013.6719170

Filename

6719170