مرکز منطقه ای اطلاع رساني علوم و فناوري - Collecting sentences from web resources for constructing spontaneous Chinese language model

DocumentCode :

3125588

Title :

Collecting sentences from web resources for constructing spontaneous Chinese language model

Author :

Xinhui Hu ; Youzheng Wu ; Matsuda, Shodai ; Hori, Chiori ; Kashioka, Hideki

Author_Institution :

Spoken Language Commun. Lab., Nat. Inst. of Inf. & Commun. Technol. (NICT), Kyoto, Japan

fYear :

2012

fDate :

5-8 Dec. 2012

Firstpage :

197

Lastpage :

200

Abstract :

In this paper, we present our work on collecting spontaneous texts from the Web for constructing a language model in a Chinese speech recognition system. The selection of spontaneous-like texts involves two steps: First, word-segmented web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words from similarity measurements. Second, the selected texts are then clustered based on non-noun part-of-speech (POS) words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. Using the language model interpolated with the one trained by the selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain spontaneous test set. We effectively reduced the character error rate (CER), with 1.64% absolute (or 6.5% relative) reduction by comparison with the baseline model. We also verified that the proposed method is superior to the conventional perplexity-based approach with about 1% absolute (or 4.0% relative) reduction in CER.

Keywords :

Internet; interpolation; natural language processing; pattern clustering; speech recognition; statistical analysis; text analysis; word processing; CER; Chinese speech recognition system; POS; Web resources; baseline model; character error rate; language model interpolation; nonnoun part-of-speech word; open domain spontaneous test set; optimal clusters; perplexity-based approach; sentence collection; similarity measurement; speech recognition evaluations; spontaneous Chinese language model construction; spontaneous seed sentences; spontaneous text collection; spontaneous-like text selection; style-related words; text clustering; word-segmented Web text; Adaptation models; Data models; Error analysis; Speech; Speech recognition; Training; Vocabulary; Text collection; language model; spontaneous speech recognition; web data;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on

Conference_Location :

Kowloon

Print_ISBN :

978-1-4673-2506-6

Electronic_ISBN :

978-1-4673-2505-9

Type :

conf

DOI :

10.1109/ISCSLP.2012.6423548

Filename :

6423548

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3125588