DocumentCode :
3125588
Title :
Collecting sentences from web resources for constructing spontaneous Chinese language model
Author :
Xinhui Hu ; Youzheng Wu ; Matsuda, Shodai ; Hori, Chiori ; Kashioka, Hideki
Author_Institution :
Spoken Language Commun. Lab., Nat. Inst. of Inf. & Commun. Technol. (NICT), Kyoto, Japan
fYear :
2012
fDate :
5-8 Dec. 2012
Firstpage :
197
Lastpage :
200
Abstract :
In this paper, we present our work on collecting spontaneous texts from the Web for constructing a language model in a Chinese speech recognition system. The selection of spontaneous-like texts involves two steps: First, word-segmented web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words from similarity measurements. Second, the selected texts are then clustered based on non-noun part-of-speech (POS) words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. Using the language model interpolated with the one trained by the selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain spontaneous test set. We effectively reduced the character error rate (CER), with 1.64% absolute (or 6.5% relative) reduction by comparison with the baseline model. We also verified that the proposed method is superior to the conventional perplexity-based approach with about 1% absolute (or 4.0% relative) reduction in CER.
Keywords :
Internet; interpolation; natural language processing; pattern clustering; speech recognition; statistical analysis; text analysis; word processing; CER; Chinese speech recognition system; POS; Web resources; baseline model; character error rate; language model interpolation; nonnoun part-of-speech word; open domain spontaneous test set; optimal clusters; perplexity-based approach; sentence collection; similarity measurement; speech recognition evaluations; spontaneous Chinese language model construction; spontaneous seed sentences; spontaneous text collection; spontaneous-like text selection; style-related words; text clustering; word-segmented Web text; Adaptation models; Data models; Error analysis; Speech; Speech recognition; Training; Vocabulary; Text collection; language model; spontaneous speech recognition; web data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on
Conference_Location :
Kowloon
Print_ISBN :
978-1-4673-2506-6
Electronic_ISBN :
978-1-4673-2505-9
Type :
conf
DOI :
10.1109/ISCSLP.2012.6423548
Filename :
6423548
Link To Document :
بازگشت