DocumentCode
642897
Title
Developing text and speech databases for speech recognition of Vietnamese
Author
Nguyen Thien Chuong ; Chaloupka, J.
Author_Institution
Inst. of Inf. Technol. & Electron., Tech. Univ. of Liberec, Liberec, Czech Republic
Volume
01
fYear
2013
fDate
12-14 Sept. 2013
Firstpage
163
Lastpage
166
Abstract
This paper describes our study on developing the text and speech databases for automatic speech recognition of Vietnamese using an available source of linguistic data: the Internet. First, a two-stage procedure is applied to extract a general text corpus which can be used for researches on Vietnamese language such as speech recognition, audio-visual speech recognition, and natural language processing... We also collect another specific text corpus in the field of news and literature using the resource from some main Web sites of Vietnamese. The total text corpus containing 8,681,869 sentences with more than 124 million syllables is then used to build and test the language model for the speech recognizer. Besides, the collecting of speech corpora for experiments on continuous speech recognition and audio-visual speech recognition of Vietnamese are also described.
Keywords
Web sites; audio-visual systems; natural language processing; query processing; speech recognition; text analysis; Vietnamese Web sites; Vietnamese language; audio-visual speech recognition; automatic speech recognition; linguistic data source; natural language processing; speech database development; text corpus extraction; text database development; text sentences; text syllables; two-stage procedure; Electronic publishing; Encyclopedias; Internet; Speech; Speech recognition; Vocabulary; Vietnamese language; speech corpus; text corpus; tonal language;
fLanguage
English
Publisher
ieee
Conference_Titel
Intelligent Data Acquisition and Advanced Computing Systems (IDAACS), 2013 IEEE 7th International Conference on
Conference_Location
Berlin
Print_ISBN
978-1-4799-1426-5
Type
conf
DOI
10.1109/IDAACS.2013.6662662
Filename
6662662
Link To Document