Automatic online text selection for constructing text corpus with custom phonetic distribution

Author

Vorapatratorn, Surapol ; Suchato, Atiwong ; Punyabukkana, Proadpran

Author_Institution

Dept. of Comput. Eng., Chulalongkorn Univ., Bangkok, Thailand

fYear

2012

fDate

May 30 2012-June 1 2012

Firstpage

6

Lastpage

11

Abstract

Performance of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems depends on an appropriate text corpus. This article explains about the automated text corpus generation method using custom phonetic distribution. This distribution is defined by phoneme types, corpus size, the minimum criterion number of phonemes, and target phonetic distribution. Generally, the system selects text data from the Internet by continuously downloading them using a web crawler. The greedy algorithm is applied to extract the proper sentences, in order to fit with the target phonetic distribution until the appropriate text corpus is established. The experiment is done by using the text from the Large Vocabulary Continuous Speech Recognition (LVCSR) corpus for Thai language [1] to generate the target phonetic distribution. The result shows that the increased number of data drawn from the Internet is able to accomplish the target phonetic distribution and generates diphone coverage for 99.13%. This text corpus, then, can be used to generate the speech corpus efficiently.

Keywords

Internet; greedy algorithms; information retrieval; natural languages; speech recognition; text analysis; ASR; Internet; LVCSR corpus; TTS; Thai language; Web crawler; automated text corpus generation method; automatic online text selection; automatic speech recognition; corpus size; custom phonetic distribution; data downloading; diphone coverage generation; greedy algorithm; large-vocabulary continuous speech recognition corpus; phoneme minimum criterion number; phoneme types; proper sentence extraction; speech corpus generation; target phonetic distribution; text-to-speech systems; Databases; Equations; Greedy algorithms; Internet; Mathematical model; Speech; Vocabulary; greedy algorithm; online corpus; phonetic; phonetically balanced; sentence segmentation; text selection;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Science and Software Engineering (JCSSE), 2012 International Joint Conference on

Conference_Location

Bangkok

Print_ISBN

978-1-4673-1920-1

Type

conf

DOI

10.1109/JCSSE.2012.6261916

Filename

6261916